mbox series

[RFC,v6,00/20] ceph+fscrypt: context, filename and symlink support

Message ID 20210413175052.163865-1-jlayton@kernel.org (mailing list archive)
Headers show
Series ceph+fscrypt: context, filename and symlink support | expand

Message

Jeff Layton April 13, 2021, 5:50 p.m. UTC
The main change in this posting is in the detection of fscrypted inodes.
The older set would grovel around in the xattr blob to see if it had an
"encryption.ctx" xattr. This was problematic if the MDS didn't send
xattrs in the trace, and not very efficient.

This posting changes it to use the new "fscrypt" flag, which should
always be reported by the MDS (Luis, I'm hoping this may fix the issues
you were seeing with dcache coherency).

This unfortunately requires an MDS fix, but that should hopefully make
it in and be backported to Pacific fairly soon:

    https://github.com/ceph/ceph/pull/40828

We also now handle get_name in the NFS export code correctly.

Aside from that, there are better changelogs, particularly on the
fscrypt and vfs patches, and some smaller bugfixes and optimizations.

Jeff Layton (20):
  vfs: export new_inode_pseudo
  fscrypt: export fscrypt_base64_encode and fscrypt_base64_decode
  fscrypt: export fscrypt_fname_encrypt and fscrypt_fname_encrypted_size
  fscrypt: add fscrypt_context_for_new_inode
  ceph: crypto context handling for ceph
  ceph: implement -o test_dummy_encryption mount option
  ceph: preallocate inode for ops that may create one
  ceph: add routine to create fscrypt context prior to RPC
  ceph: make ceph_msdc_build_path use ref-walk
  ceph: add encrypted fname handling to ceph_mdsc_build_path
  ceph: decode alternate_name in lease info
  ceph: send altname in MClientRequest
  ceph: properly set DCACHE_NOKEY_NAME flag in lookup
  ceph: make d_revalidate call fscrypt revalidator for encrypted
    dentries
  ceph: add helpers for converting names for userland presentation
  ceph: add fscrypt support to ceph_fill_trace
  ceph: add support to readdir for encrypted filenames
  ceph: create symlinks with encrypted and base64-encoded targets
  ceph: make ceph_get_name decrypt filenames
  ceph: add fscrypt ioctls

 fs/ceph/Makefile            |   1 +
 fs/ceph/crypto.c            | 185 ++++++++++++++++++++++
 fs/ceph/crypto.h            | 101 ++++++++++++
 fs/ceph/dir.c               | 178 ++++++++++++++++-----
 fs/ceph/export.c            |  42 +++--
 fs/ceph/file.c              |  58 ++++---
 fs/ceph/inode.c             | 248 ++++++++++++++++++++++++++---
 fs/ceph/ioctl.c             |  93 +++++++++++
 fs/ceph/mds_client.c        | 303 ++++++++++++++++++++++++++++++------
 fs/ceph/mds_client.h        |  15 +-
 fs/ceph/super.c             |  80 +++++++++-
 fs/ceph/super.h             |  15 +-
 fs/ceph/xattr.c             |   5 +
 fs/crypto/fname.c           |  53 +++++--
 fs/crypto/fscrypt_private.h |   9 +-
 fs/crypto/hooks.c           |   6 +-
 fs/crypto/policy.c          |  34 +++-
 fs/inode.c                  |   1 +
 include/linux/fscrypt.h     |  10 ++
 19 files changed, 1263 insertions(+), 174 deletions(-)
 create mode 100644 fs/ceph/crypto.c
 create mode 100644 fs/ceph/crypto.h

Comments

Luis Henriques April 19, 2021, 10:30 a.m. UTC | #1
Jeff Layton <jlayton@kernel.org> writes:

> The main change in this posting is in the detection of fscrypted inodes.
> The older set would grovel around in the xattr blob to see if it had an
> "encryption.ctx" xattr. This was problematic if the MDS didn't send
> xattrs in the trace, and not very efficient.
>
> This posting changes it to use the new "fscrypt" flag, which should
> always be reported by the MDS (Luis, I'm hoping this may fix the issues
> you were seeing with dcache coherency).

I just fetched from your updated 'ceph-fscrypt-fnames' branch (which I
assume contains this RFC series) and I'm now seeing the splat bellow.

Cheers,
--
Luis

[  149.508364] ============================================  
[  149.511075] WARNING: possible recursive locking detected  
[  149.513652] 5.12.0-rc4+ #140 Not tainted                                                                                                                                   
[  149.515656] --------------------------------------------                          
[  149.518293] cat/273 is trying to acquire lock:                                      
[  149.520570] ffff88813b3f6070 (&mdsc->mutex){+.+.}-{3:3}, at: ceph_mdsc_submit_request+0x206/0x600 [ceph]
[  149.525497]                                                                         
[  149.525497] but task is already holding lock:                                
[  149.528420] ffff88813b3f6070 (&mdsc->mutex){+.+.}-{3:3}, at: ceph_mdsc_submit_request+0x206/0x600 [ceph]
[  149.533163]                                                                         
[  149.533163] other info that might help us debug this:
[  149.536383]  Possible unsafe locking scenario:
[  149.536383] 
[  149.539344]        CPU0
[  149.540588]        ----
[  149.541870]   lock(&mdsc->mutex);
[  149.543534]   lock(&mdsc->mutex);
[  149.545205] 
[  149.545205]  *** DEADLOCK ***
[  149.545205] 
[  149.548142]  May be due to missing lock nesting notation
[  149.548142] 
[  149.551254] 2 locks held by cat/273:
[  149.552926]  #0: ffff88812296b590 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x959/0xe10
[  149.556923]  #1: ffff88813b3f6070 (&mdsc->mutex){+.+.}-{3:3}, at: ceph_mdsc_submit_request+0x206/0x600 [ceph]
[  149.560954] 
[  149.560954] stack backtrace:
[  149.562574] CPU: 0 PID: 273 Comm: cat Not tainted 5.12.0-rc4+ #140
[  149.564785] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[  149.567573] Call Trace:
[  149.568207]  dump_stack+0x93/0xc2
[  149.569072]  __lock_acquire.cold+0x2e5/0x30f
[  149.570100]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
[  149.571318]  ? stack_trace_save+0x91/0xc0
[  149.572272]  ? mark_held_locks+0x65/0x90
[  149.573196]  lock_acquire+0x16d/0x4e0
[  149.574030]  ? ceph_mdsc_submit_request+0x206/0x600 [ceph]
[  149.575340]  ? lock_release+0x410/0x410
[  149.576211]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
[  149.577348]  ? mark_lock+0x101/0x1a20
[  149.578145]  ? mark_lock+0x101/0x1a20
[  149.578948]  __mutex_lock+0xfd/0xb80
[  149.579773]  ? ceph_mdsc_submit_request+0x206/0x600 [ceph]
[  149.581031]  ? ceph_mdsc_submit_request+0x206/0x600 [ceph]
[  149.582174]  ? ceph_get_cap_refs+0x1c/0x40 [ceph]
[  149.583171]  ? mutex_lock_io_nested+0xab0/0xab0
[  149.584079]  ? lock_release+0x1ea/0x410
[  149.584869]  ? ceph_mdsc_submit_request+0x42/0x600 [ceph]
[  149.585962]  ? lock_downgrade+0x390/0x390
[  149.586737]  ? lock_is_held_type+0x98/0x110
[  149.587565]  ? ceph_take_cap_refs+0x43/0x220 [ceph]
[  149.588560]  ceph_mdsc_submit_request+0x206/0x600 [ceph]
[  149.589603]  ceph_mdsc_do_request+0x31/0x320 [ceph]
[  149.590554]  __ceph_do_getattr+0xf9/0x2b0 [ceph]
[  149.591453]  __ceph_getxattr+0x2fa/0x480 [ceph]
[  149.592337]  ? find_held_lock+0x85/0xa0
[  149.593055]  ? lock_is_held_type+0x98/0x110
[  149.593799]  ceph_crypt_get_context+0x17/0x20 [ceph]
[  149.594732]  fscrypt_get_encryption_info+0x133/0x220
[  149.595621]  ? fscrypt_prepare_new_inode+0x160/0x160
[  149.596512]  ? dget_parent+0x95/0x2f0
[  149.597166]  ? lock_downgrade+0x390/0x390
[  149.597850]  ? rwlock_bug.part.0+0x60/0x60
[  149.598567]  ? lock_downgrade+0x390/0x390
[  149.599251]  ? do_raw_spin_unlock+0x93/0xf0
[  149.599968]  ? dget_parent+0xc4/0x2f0
[  149.600604]  ceph_mdsc_build_path.part.0+0x367/0x7c0 [ceph]
[  149.601587]  ? remove_session_caps_cb+0x7b0/0x7b0 [ceph]
[  149.602506]  ? __lock_acquire+0x863/0x3070
[  149.603188]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
[  149.604030]  ? __is_insn_slot_addr+0xc9/0x140
[  149.604774]  ? mark_lock+0x101/0x1a20
[  149.605365]  ? lock_is_held_type+0x98/0x110
[  149.606040]  ? find_held_lock+0x85/0xa0
[  149.606660]  ? lock_release+0x1ea/0x410
[  149.607279]  ? set_request_path_attr+0x173/0x500 [ceph]
[  149.608174]  ? lock_downgrade+0x390/0x390
[  149.608825]  ? find_held_lock+0x85/0xa0
[  149.609443]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
[  149.610267]  ? lock_release+0x1ea/0x410
[  149.610924]  set_request_path_attr+0x1a5/0x500 [ceph]
[  149.611811]  __prepare_send_request+0x30e/0x13c0 [ceph]
[  149.612847]  ? rwlock_bug.part.0+0x60/0x60
[  149.613551]  ? set_request_path_attr+0x500/0x500 [ceph]
[  149.614540]  ? __choose_mds+0x323/0xcb0 [ceph]
[  149.615398]  ? trim_caps_cb+0x3b0/0x3b0 [ceph]
[  149.616215]  ? rwlock_bug.part.0+0x60/0x60
[  149.616953]  ? ceph_get_mds_session+0xad/0x1e0 [ceph]
[  149.617847]  ? ceph_session_state_name+0x30/0x30 [ceph]
[  149.618788]  ? ceph_reserve_caps+0x331/0x5a0 [ceph]
[  149.619626]  __do_request+0x338/0x9b0 [ceph]
[  149.620376]  ? cleanup_session_requests+0x1b0/0x1b0 [ceph]
[  149.621347]  ? lock_is_held_type+0x98/0x110
[  149.622052]  ceph_mdsc_submit_request+0x4af/0x600 [ceph]
[  149.622998]  ceph_mdsc_do_request+0x31/0x320 [ceph]
[  149.623885]  ceph_atomic_open+0x3be/0x1050 [ceph]
[  149.624729]  ? d_alloc_parallel+0x576/0xe50
[  149.625309]  ? ceph_renew_caps+0x270/0x270 [ceph]
[  149.625986]  ? __d_lookup_rcu+0x2e0/0x2e0
[  149.626539]  ? lock_is_held_type+0x98/0x110
[  149.627113]  ? lockdep_hardirqs_on_prepare+0x12e/0x1f0
[  149.627835]  lookup_open.isra.0+0x5d2/0x7f0
[  149.628407]  ? hashlen_string+0xa0/0xa0
[  149.628961]  path_openat+0x457/0xe10
[  149.629468]  ? path_parentat+0xc0/0xc0
[  149.630047]  ? __alloc_pages_slowpath.constprop.0+0x1070/0x1070
[  149.630825]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
[  149.631540]  ? mntput_no_expire+0xe6/0x650
[  149.632080]  ? mark_held_locks+0x24/0x90
[  149.632605]  do_filp_open+0x10b/0x220
[  149.633100]  ? may_open_dev+0x50/0x50
[  149.633577]  ? lock_downgrade+0x390/0x390
[  149.634147]  ? do_raw_spin_lock+0x119/0x1b0
[  149.634785]  ? rwlock_bug.part.0+0x60/0x60
[  149.635423]  ? do_raw_spin_unlock+0x93/0xf0
[  149.636094]  ? _raw_spin_unlock+0x1f/0x30
[  149.636735]  ? alloc_fd+0x150/0x300
[  149.637284]  do_sys_openat2+0x115/0x240
[  149.637887]  ? build_open_flags+0x270/0x270
[  149.638511]  ? __ia32_compat_sys_newlstat+0x30/0x30
[  149.639264]  __x64_sys_openat+0xce/0x140
[  149.639878]  ? __ia32_compat_sys_open+0x120/0x120
[  149.640622]  ? lockdep_hardirqs_on_prepare+0x12e/0x1f0
[  149.641389]  ? syscall_enter_from_user_mode+0x1d/0x50
[  149.642175]  ? trace_hardirqs_on+0x32/0x100
[  149.642835]  do_syscall_64+0x33/0x40
[  149.643395]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  149.644165] RIP: 0033:0x7f6d190daffb
[  149.644705] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 5
[  149.647306] RSP: 002b:00007ffe706cec20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  149.648294] RAX: ffffffffffffffda RBX: 000055656fd16000 RCX: 00007f6d190daffb
[  149.649200] RDX: 0000000000000000 RSI: 00007ffe706d0eda RDI: 00000000ffffff9c
[  149.650094] RBP: 00007ffe706d0eda R08: 0000000000000000 R09: 0000000000000000
[  149.650983] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  149.651876] R13: 0000000000000002 R14: 00007ffe706cef48 R15: 0000000000020000
Jeff Layton April 19, 2021, 12:23 p.m. UTC | #2
On Mon, 2021-04-19 at 11:30 +0100, Luis Henriques wrote:
> Jeff Layton <jlayton@kernel.org> writes:
> 
> > The main change in this posting is in the detection of fscrypted inodes.
> > The older set would grovel around in the xattr blob to see if it had an
> > "encryption.ctx" xattr. This was problematic if the MDS didn't send
> > xattrs in the trace, and not very efficient.
> > 
> > This posting changes it to use the new "fscrypt" flag, which should
> > always be reported by the MDS (Luis, I'm hoping this may fix the issues
> > you were seeing with dcache coherency).
> 
> I just fetched from your updated 'ceph-fscrypt-fnames' branch (which I
> assume contains this RFC series) and I'm now seeing the splat bellow.
> 
> Cheers,
> --
> Luis
> 
> [  149.508364] ============================================  
> [  149.511075] WARNING: possible recursive locking detected  
> [  149.513652] 5.12.0-rc4+ #140 Not tainted                                                                                                                                   
> [  149.515656] --------------------------------------------                          
> [  149.518293] cat/273 is trying to acquire lock:                                      
> [  149.520570] ffff88813b3f6070 (&mdsc->mutex){+.+.}-{3:3}, at: ceph_mdsc_submit_request+0x206/0x600 [ceph]
> [  149.525497]                                                                         
> [  149.525497] but task is already holding lock:                                
> [  149.528420] ffff88813b3f6070 (&mdsc->mutex){+.+.}-{3:3}, at: ceph_mdsc_submit_request+0x206/0x600 [ceph]
> [  149.533163]                                                                         
> [  149.533163] other info that might help us debug this:
> [  149.536383]  Possible unsafe locking scenario:
> [  149.536383] 
> [  149.539344]        CPU0
> [  149.540588]        ----
> [  149.541870]   lock(&mdsc->mutex);
> [  149.543534]   lock(&mdsc->mutex);
> [  149.545205] 
> [  149.545205]  *** DEADLOCK ***
> [  149.545205] 
> [  149.548142]  May be due to missing lock nesting notation
> [  149.548142] 
> [  149.551254] 2 locks held by cat/273:
> [  149.552926]  #0: ffff88812296b590 (&type->i_mutex_dir_key#7){++++}-{3:3}, at: path_openat+0x959/0xe10
> [  149.556923]  #1: ffff88813b3f6070 (&mdsc->mutex){+.+.}-{3:3}, at: ceph_mdsc_submit_request+0x206/0x600 [ceph]
> [  149.560954] 
> [  149.560954] stack backtrace:
> [  149.562574] CPU: 0 PID: 273 Comm: cat Not tainted 5.12.0-rc4+ #140
> [  149.564785] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> [  149.567573] Call Trace:
> [  149.568207]  dump_stack+0x93/0xc2
> [  149.569072]  __lock_acquire.cold+0x2e5/0x30f
> [  149.570100]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
> [  149.571318]  ? stack_trace_save+0x91/0xc0
> [  149.572272]  ? mark_held_locks+0x65/0x90
> [  149.573196]  lock_acquire+0x16d/0x4e0
> [  149.574030]  ? ceph_mdsc_submit_request+0x206/0x600 [ceph]
> [  149.575340]  ? lock_release+0x410/0x410
> [  149.576211]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
> [  149.577348]  ? mark_lock+0x101/0x1a20
> [  149.578145]  ? mark_lock+0x101/0x1a20
> [  149.578948]  __mutex_lock+0xfd/0xb80
> [  149.579773]  ? ceph_mdsc_submit_request+0x206/0x600 [ceph]
> [  149.581031]  ? ceph_mdsc_submit_request+0x206/0x600 [ceph]
> [  149.582174]  ? ceph_get_cap_refs+0x1c/0x40 [ceph]
> [  149.583171]  ? mutex_lock_io_nested+0xab0/0xab0
> [  149.584079]  ? lock_release+0x1ea/0x410
> [  149.584869]  ? ceph_mdsc_submit_request+0x42/0x600 [ceph]
> [  149.585962]  ? lock_downgrade+0x390/0x390
> [  149.586737]  ? lock_is_held_type+0x98/0x110
> [  149.587565]  ? ceph_take_cap_refs+0x43/0x220 [ceph]
> [  149.588560]  ceph_mdsc_submit_request+0x206/0x600 [ceph]
> [  149.589603]  ceph_mdsc_do_request+0x31/0x320 [ceph]
> [  149.590554]  __ceph_do_getattr+0xf9/0x2b0 [ceph]
> [  149.591453]  __ceph_getxattr+0x2fa/0x480 [ceph]
> [  149.592337]  ? find_held_lock+0x85/0xa0
> [  149.593055]  ? lock_is_held_type+0x98/0x110
> [  149.593799]  ceph_crypt_get_context+0x17/0x20 [ceph]
> [  149.594732]  fscrypt_get_encryption_info+0x133/0x220
> [  149.595621]  ? fscrypt_prepare_new_inode+0x160/0x160
> [  149.596512]  ? dget_parent+0x95/0x2f0
> [  149.597166]  ? lock_downgrade+0x390/0x390
> [  149.597850]  ? rwlock_bug.part.0+0x60/0x60
> [  149.598567]  ? lock_downgrade+0x390/0x390
> [  149.599251]  ? do_raw_spin_unlock+0x93/0xf0
> [  149.599968]  ? dget_parent+0xc4/0x2f0
> [  149.600604]  ceph_mdsc_build_path.part.0+0x367/0x7c0 [ceph]
> [  149.601587]  ? remove_session_caps_cb+0x7b0/0x7b0 [ceph]
> [  149.602506]  ? __lock_acquire+0x863/0x3070
> [  149.603188]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
> [  149.604030]  ? __is_insn_slot_addr+0xc9/0x140
> [  149.604774]  ? mark_lock+0x101/0x1a20
> [  149.605365]  ? lock_is_held_type+0x98/0x110
> [  149.606040]  ? find_held_lock+0x85/0xa0
> [  149.606660]  ? lock_release+0x1ea/0x410
> [  149.607279]  ? set_request_path_attr+0x173/0x500 [ceph]
> [  149.608174]  ? lock_downgrade+0x390/0x390
> [  149.608825]  ? find_held_lock+0x85/0xa0
> [  149.609443]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
> [  149.610267]  ? lock_release+0x1ea/0x410
> [  149.610924]  set_request_path_attr+0x1a5/0x500 [ceph]
> [  149.611811]  __prepare_send_request+0x30e/0x13c0 [ceph]
> [  149.612847]  ? rwlock_bug.part.0+0x60/0x60
> [  149.613551]  ? set_request_path_attr+0x500/0x500 [ceph]
> [  149.614540]  ? __choose_mds+0x323/0xcb0 [ceph]
> [  149.615398]  ? trim_caps_cb+0x3b0/0x3b0 [ceph]
> [  149.616215]  ? rwlock_bug.part.0+0x60/0x60
> [  149.616953]  ? ceph_get_mds_session+0xad/0x1e0 [ceph]
> [  149.617847]  ? ceph_session_state_name+0x30/0x30 [ceph]
> [  149.618788]  ? ceph_reserve_caps+0x331/0x5a0 [ceph]
> [  149.619626]  __do_request+0x338/0x9b0 [ceph]
> [  149.620376]  ? cleanup_session_requests+0x1b0/0x1b0 [ceph]
> [  149.621347]  ? lock_is_held_type+0x98/0x110
> [  149.622052]  ceph_mdsc_submit_request+0x4af/0x600 [ceph]
> [  149.622998]  ceph_mdsc_do_request+0x31/0x320 [ceph]
> [  149.623885]  ceph_atomic_open+0x3be/0x1050 [ceph]
> [  149.624729]  ? d_alloc_parallel+0x576/0xe50
> [  149.625309]  ? ceph_renew_caps+0x270/0x270 [ceph]
> [  149.625986]  ? __d_lookup_rcu+0x2e0/0x2e0
> [  149.626539]  ? lock_is_held_type+0x98/0x110
> [  149.627113]  ? lockdep_hardirqs_on_prepare+0x12e/0x1f0
> [  149.627835]  lookup_open.isra.0+0x5d2/0x7f0
> [  149.628407]  ? hashlen_string+0xa0/0xa0
> [  149.628961]  path_openat+0x457/0xe10
> [  149.629468]  ? path_parentat+0xc0/0xc0
> [  149.630047]  ? __alloc_pages_slowpath.constprop.0+0x1070/0x1070
> [  149.630825]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
> [  149.631540]  ? mntput_no_expire+0xe6/0x650
> [  149.632080]  ? mark_held_locks+0x24/0x90
> [  149.632605]  do_filp_open+0x10b/0x220
> [  149.633100]  ? may_open_dev+0x50/0x50
> [  149.633577]  ? lock_downgrade+0x390/0x390
> [  149.634147]  ? do_raw_spin_lock+0x119/0x1b0
> [  149.634785]  ? rwlock_bug.part.0+0x60/0x60
> [  149.635423]  ? do_raw_spin_unlock+0x93/0xf0
> [  149.636094]  ? _raw_spin_unlock+0x1f/0x30
> [  149.636735]  ? alloc_fd+0x150/0x300
> [  149.637284]  do_sys_openat2+0x115/0x240
> [  149.637887]  ? build_open_flags+0x270/0x270
> [  149.638511]  ? __ia32_compat_sys_newlstat+0x30/0x30
> [  149.639264]  __x64_sys_openat+0xce/0x140
> [  149.639878]  ? __ia32_compat_sys_open+0x120/0x120
> [  149.640622]  ? lockdep_hardirqs_on_prepare+0x12e/0x1f0
> [  149.641389]  ? syscall_enter_from_user_mode+0x1d/0x50
> [  149.642175]  ? trace_hardirqs_on+0x32/0x100
> [  149.642835]  do_syscall_64+0x33/0x40
> [  149.643395]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  149.644165] RIP: 0033:0x7f6d190daffb
> [  149.644705] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 5
> [  149.647306] RSP: 002b:00007ffe706cec20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
> [  149.648294] RAX: ffffffffffffffda RBX: 000055656fd16000 RCX: 00007f6d190daffb
> [  149.649200] RDX: 0000000000000000 RSI: 00007ffe706d0eda RDI: 00000000ffffff9c
> [  149.650094] RBP: 00007ffe706d0eda R08: 0000000000000000 R09: 0000000000000000
> [  149.650983] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [  149.651876] R13: 0000000000000002 R14: 00007ffe706cef48 R15: 0000000000020000

Ouch. That looks like a real bug, alright.

Basically when building the path, we occasionally need to fetch the
crypto context for parent inodes and such, and that can cause us to
recurse back into __ceph_getxattr and try to issue another RPC to the
MDS.

I'll have to look and see what we can do. Maybe it's safe to drop the
mdsc->mutex while we're building the path? Or maybe this is a good time
to re-think a lot of the really onerous locking in this codepath?

I'm open to suggestions here...
Luis Henriques April 19, 2021, 4:03 p.m. UTC | #3
Jeff Layton <jlayton@kernel.org> writes:

> On Mon, 2021-04-19 at 11:30 +0100, Luis Henriques wrote:
...
> Ouch. That looks like a real bug, alright.
>
> Basically when building the path, we occasionally need to fetch the
> crypto context for parent inodes and such, and that can cause us to
> recurse back into __ceph_getxattr and try to issue another RPC to the
> MDS.
>
> I'll have to look and see what we can do. Maybe it's safe to drop the
> mdsc->mutex while we're building the path? Or maybe this is a good time
> to re-think a lot of the really onerous locking in this codepath?
>
> I'm open to suggestions here...

Yeah, I couldn't see a good fix at a first glace.  Dropping the mutex
while building the path was my initial thought too but it's not easy to
proof that's a safe thing to do.

The other idea I had was to fetch all the needed fscrypt contexts at the
end, after building the path.  But I didn't found a way for doing that
because to build the path... we need the contexts.

It looks like this leaves us with the locking rethinking option.

/me tries harder to find another way out

Cheers,
Jeff Layton April 19, 2021, 4:28 p.m. UTC | #4
On Mon, 2021-04-19 at 17:03 +0100, Luis Henriques wrote:
> Jeff Layton <jlayton@kernel.org> writes:
> 
> > On Mon, 2021-04-19 at 11:30 +0100, Luis Henriques wrote:
> ...
> > Ouch. That looks like a real bug, alright.
> > 
> > Basically when building the path, we occasionally need to fetch the
> > crypto context for parent inodes and such, and that can cause us to
> > recurse back into __ceph_getxattr and try to issue another RPC to the
> > MDS.
> > 
> > I'll have to look and see what we can do. Maybe it's safe to drop the
> > mdsc->mutex while we're building the path? Or maybe this is a good time
> > to re-think a lot of the really onerous locking in this codepath?
> > 
> > I'm open to suggestions here...
> 
> Yeah, I couldn't see a good fix at a first glace.  Dropping the mutex
> while building the path was my initial thought too but it's not easy to
> proof that's a safe thing to do.
> 

Indeed. It's an extremely coarse-grained mutex and not at all clear what
it protects here.

> The other idea I had was to fetch all the needed fscrypt contexts at the
> end, after building the path.  But I didn't found a way for doing that
> because to build the path... we need the contexts.
> 
> It looks like this leaves us with the locking rethinking option.
> 
> /me tries harder to find another way out
> 
> Cheers,

The other option I think is to not store the context in an xattr at all,
and instead make a dedicated field in the inode for it that we can
ensure is always present for encrypted inodes.  For the most part the
crypto context is a static thing. The only exception is when we're first
encrypting an empty dir.

We already have the fscrypt bool in the inodestat, and we're going to
need another field to hold the real size for files. It may be worthwhile
to just reconsider the design at that level. Maybe we just need to carve
out a chunk of fscrypt space in the inode for the client and let it
manage that however it sees fit.
Luis Henriques April 20, 2021, 10:11 a.m. UTC | #5
Jeff Layton <jlayton@kernel.org> writes:

> On Mon, 2021-04-19 at 17:03 +0100, Luis Henriques wrote:
>> Jeff Layton <jlayton@kernel.org> writes:
>> 
>> > On Mon, 2021-04-19 at 11:30 +0100, Luis Henriques wrote:
>> ...
>> > Ouch. That looks like a real bug, alright.
>> > 
>> > Basically when building the path, we occasionally need to fetch the
>> > crypto context for parent inodes and such, and that can cause us to
>> > recurse back into __ceph_getxattr and try to issue another RPC to the
>> > MDS.
>> > 
>> > I'll have to look and see what we can do. Maybe it's safe to drop the
>> > mdsc->mutex while we're building the path? Or maybe this is a good time
>> > to re-think a lot of the really onerous locking in this codepath?
>> > 
>> > I'm open to suggestions here...
>> 
>> Yeah, I couldn't see a good fix at a first glace.  Dropping the mutex
>> while building the path was my initial thought too but it's not easy to
>> proof that's a safe thing to do.
>> 
>
> Indeed. It's an extremely coarse-grained mutex and not at all clear what
> it protects here.
>
>> The other idea I had was to fetch all the needed fscrypt contexts at the
>> end, after building the path.  But I didn't found a way for doing that
>> because to build the path... we need the contexts.
>> 
>> It looks like this leaves us with the locking rethinking option.
>> 
>> /me tries harder to find another way out
>> 
>> Cheers,
>
> The other option I think is to not store the context in an xattr at all,
> and instead make a dedicated field in the inode for it that we can
> ensure is always present for encrypted inodes.  For the most part the
> crypto context is a static thing. The only exception is when we're first
> encrypting an empty dir.
>
> We already have the fscrypt bool in the inodestat, and we're going to
> need another field to hold the real size for files. It may be worthwhile
> to just reconsider the design at that level. Maybe we just need to carve
> out a chunk of fscrypt space in the inode for the client and let it
> manage that however it sees fit.

That's another solution.  Since the initial (naïfe) idea of having a
client-only implementation with fscrypt-agnostic MDSs is long gone, the
design can (still) be fixed to do that.  This will definitely allow to
move forward with the fscrypt implementation.  (But we'll probably be
bitten again with these recursive RPCs in the future!)

Anyway, this is probably the most interesting solution as it also reduces
the need for extra calls to MDS.  And the fscrypt bool in inodestat
probably becomes redundant and can be dropped.

Cheers,
Jeff Layton April 20, 2021, 3:52 p.m. UTC | #6
On Tue, 2021-04-20 at 11:11 +0100, Luis Henriques wrote:
> Jeff Layton <jlayton@kernel.org> writes:
> 
> > On Mon, 2021-04-19 at 17:03 +0100, Luis Henriques wrote:
> > > Jeff Layton <jlayton@kernel.org> writes:
> > > 
> > > > On Mon, 2021-04-19 at 11:30 +0100, Luis Henriques wrote:
> > > ...
> > > > Ouch. That looks like a real bug, alright.
> > > > 
> > > > Basically when building the path, we occasionally need to fetch the
> > > > crypto context for parent inodes and such, and that can cause us to
> > > > recurse back into __ceph_getxattr and try to issue another RPC to the
> > > > MDS.
> > > > 
> > > > I'll have to look and see what we can do. Maybe it's safe to drop the
> > > > mdsc->mutex while we're building the path? Or maybe this is a good time
> > > > to re-think a lot of the really onerous locking in this codepath?
> > > > 
> > > > I'm open to suggestions here...
> > > 
> > > Yeah, I couldn't see a good fix at a first glace.  Dropping the mutex
> > > while building the path was my initial thought too but it's not easy to
> > > proof that's a safe thing to do.
> > > 
> > 
> > Indeed. It's an extremely coarse-grained mutex and not at all clear what
> > it protects here.
> > 
> > > The other idea I had was to fetch all the needed fscrypt contexts at the
> > > end, after building the path.  But I didn't found a way for doing that
> > > because to build the path... we need the contexts.
> > > 
> > > It looks like this leaves us with the locking rethinking option.
> > > 
> > > /me tries harder to find another way out
> > > 
> > > Cheers,
> > 
> > The other option I think is to not store the context in an xattr at all,
> > and instead make a dedicated field in the inode for it that we can
> > ensure is always present for encrypted inodes.  For the most part the
> > crypto context is a static thing. The only exception is when we're first
> > encrypting an empty dir.
> > 
> > We already have the fscrypt bool in the inodestat, and we're going to
> > need another field to hold the real size for files. It may be worthwhile
> > to just reconsider the design at that level. Maybe we just need to carve
> > out a chunk of fscrypt space in the inode for the client and let it
> > manage that however it sees fit.
> 
> That's another solution.  Since the initial (naïfe) idea of having a
> client-only implementation with fscrypt-agnostic MDSs is long gone, the
> design can (still) be fixed to do that.  This will definitely allow to
> move forward with the fscrypt implementation.  (But we'll probably be
> bitten again with these recursive RPCs in the future!)
> 
> Anyway, this is probably the most interesting solution as it also reduces
> the need for extra calls to MDS.  And the fscrypt bool in inodestat
> probably becomes redundant and can be dropped.
> 

We probably can't drop the bool from the protocol, as it's now in a
released version (Pacific).

What we can do is drop tracking the bool internally in the MDS, and just
set that to true if the fscrypt blob isn't zero-length.

Cheers,