Message ID | 37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [BUG,REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs | expand |
On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote: > Problem description > =================== > > The inode reclaiming process(See function prune_icache_sb) collects all > reclaimable inodes and mark them with I_FREEING flag at first, at that > time, other processes will be stuck if they try getting these inodes(See > function find_inode_fast), then the reclaiming process destroy the > inodes by function dispose_list(). > Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may > do inode lookup in the inode evicting callback function, if the inode > lookup is operated under the inode lru traversing context, deadlock > problems may happen. > > Case 1: In function ext4_evict_inode(), the ea inode lookup could happen > if ea_inode feature is enabled, the lookup process will be stuck under > the evicting context like this: > > 1. File A has inode i_reg and an ea inode i_ea > 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea > 3. Then, following three processes running like this: > > PA PB > echo 2 > /proc/sys/vm/drop_caches > shrink_slab > prune_dcache_sb > // i_reg is added into lru, lru->i_ea->i_reg > prune_icache_sb > list_lru_walk_one > inode_lru_isolate > i_ea->i_state |= I_FREEING // set inode state > i_ea->i_state |= I_FREEING // set inode state Um, I don't see how this can happen. If the ea_inode is in use, i_count will be greater than zero, and hence the inode will never be go down the rest of the path in inode_lru_inode(): if (atomic_read(&inode->i_count) || ...) { list_lru_isolate(lru, &inode->i_lru); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; } Do you have an actual reproduer which triggers this? Or would this happen be any chance something that was dreamed up with DEPT? - Ted
在 2024/7/12 22:37, Theodore Ts'o 写道: >> Problem description >> =================== >> >> The inode reclaiming process(See function prune_icache_sb) collects all >> reclaimable inodes and mark them with I_FREEING flag at first, at that >> time, other processes will be stuck if they try getting these inodes(See >> function find_inode_fast), then the reclaiming process destroy the >> inodes by function dispose_list(). >> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may >> do inode lookup in the inode evicting callback function, if the inode >> lookup is operated under the inode lru traversing context, deadlock >> problems may happen. >> >> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen >> if ea_inode feature is enabled, the lookup process will be stuck under >> the evicting context like this: >> >> 1. File A has inode i_reg and an ea inode i_ea >> 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea >> 3. Then, following three processes running like this: >> >> PA PB >> echo 2 > /proc/sys/vm/drop_caches >> shrink_slab >> prune_dcache_sb >> // i_reg is added into lru, lru->i_ea->i_reg >> prune_icache_sb >> list_lru_walk_one >> inode_lru_isolate >> i_ea->i_state |= I_FREEING // set inode state >> i_ea->i_state |= I_FREEING // set inode state > Um, I don't see how this can happen. If the ea_inode is in use, > i_count will be greater than zero, and hence the inode will never be > go down the rest of the path in inode_lru_inode(): The counter of ea_inode could become zero before the file inode, according to the following process: path_getxattr user_path_at(&path) // get file dentry and file inode getxattr ext4_xattr_get ext4_xattr_ibody_get ext4_xattr_inode_get ext4_xattr_inode_iget(&ea_inode) // ea_inode->i_count = 1 iput(ea_inode) // ea_inode->i_count = 0, put it into lru path_put(&path); // put file dentry and file inode > > if (atomic_read(&inode->i_count) || > ...) { > list_lru_isolate(lru, &inode->i_lru); > spin_unlock(&inode->i_lock); > this_cpu_dec(nr_unused); > return LRU_REMOVED; > } > > Do you have an actual reproduer which triggers this? Or would this > happen be any chance something that was dreamed up with DEPT? The reproducer is in the second half of the ariticle, along with some of the solutions we tried. Reproducer: =========== https://bugzilla.kernel.org/show_bug.cgi?id=219022 About solutions =============== [...]
> Um, I don't see how this can happen. If the ea_inode is in use, > i_count will be greater than zero, and hence the inode will never be > go down the rest of the path in inode_lru_inode(): > > if (atomic_read(&inode->i_count) || > ...) { > list_lru_isolate(lru, &inode->i_lru); > spin_unlock(&inode->i_lock); > this_cpu_dec(nr_unused); > return LRU_REMOVED; > } Yes, in the function inode_lru_inode (in case of clearing cache), there has been such inode->i_state check mechanism to avoid double-removing the inode which is being removed by another process. Unluckily, no such similar inode->i_state check mechanism in the function iput_final (in case of removing file), so double-removing inode can still appear. It looks we need to add some inode->i_state check in iput_final() , if we want to fix this race condition bug.
在 2024/7/18 11:04, Ryder Wang 写道: Hi, Ryder >> Um, I don't see how this can happen. If the ea_inode is in use, >> i_count will be greater than zero, and hence the inode will never be >> go down the rest of the path in inode_lru_inode(): >> >> if (atomic_read(&inode->i_count) || >> ...) { >> list_lru_isolate(lru, &inode->i_lru); >> spin_unlock(&inode->i_lock); >> this_cpu_dec(nr_unused); >> return LRU_REMOVED; >> } > > Yes, in the function inode_lru_inode (in case of clearing cache), there has been such inode->i_state check mechanism to avoid double-removing the inode which is being removed by another process. Unluckily, no such similar inode->i_state check mechanism in the function iput_final (in case of removing file), so double-removing inode can still appear. I'm a little confused about the process of inode double-removing, can you provide a detailed case about how double-revemoving happens? I can't find the relationship between inode double-removing and the problem i described. > > It looks we need to add some inode->i_state check in iput_final() , if we want to fix this race condition bug.
On Fri 12-07-24 10:37:08, Theodore Ts'o wrote: > On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote: > > Problem description > > =================== > > > > The inode reclaiming process(See function prune_icache_sb) collects all > > reclaimable inodes and mark them with I_FREEING flag at first, at that > > time, other processes will be stuck if they try getting these inodes(See > > function find_inode_fast), then the reclaiming process destroy the > > inodes by function dispose_list(). > > Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may > > do inode lookup in the inode evicting callback function, if the inode > > lookup is operated under the inode lru traversing context, deadlock > > problems may happen. > > > > Case 1: In function ext4_evict_inode(), the ea inode lookup could happen > > if ea_inode feature is enabled, the lookup process will be stuck under > > the evicting context like this: > > > > 1. File A has inode i_reg and an ea inode i_ea > > 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea > > 3. Then, following three processes running like this: > > > > PA PB > > echo 2 > /proc/sys/vm/drop_caches > > shrink_slab > > prune_dcache_sb > > // i_reg is added into lru, lru->i_ea->i_reg > > prune_icache_sb > > list_lru_walk_one > > inode_lru_isolate > > i_ea->i_state |= I_FREEING // set inode state > > i_ea->i_state |= I_FREEING // set inode state > > Um, I don't see how this can happen. If the ea_inode is in use, > i_count will be greater than zero, and hence the inode will never be > go down the rest of the path in inode_lru_inode(): > > if (atomic_read(&inode->i_count) || > ...) { > list_lru_isolate(lru, &inode->i_lru); > spin_unlock(&inode->i_lock); > this_cpu_dec(nr_unused); > return LRU_REMOVED; > } > > Do you have an actual reproduer which triggers this? Or would this > happen be any chance something that was dreamed up with DEPT? No, it looks like a real problem and I agree with the analysis. We don't hold ea_inode reference (i.e., ea_inode->i_count) from a normal inode. The normal inode just owns that that special on-disk xattr reference. Standard inode references are acquired and dropped as needed. And this is exactly the problem: ext4_xattr_inode_dec_ref_all() called from evict() needs to lookup the ea_inode and iget() it. So if we are processing a list of inodes to dispose, all inodes have I_FREEING bit already set and if ea_inode and its parent normal inode are both in the list, then the evict()->ext4_xattr_inode_dec_ref_all()->iget() will deadlock. Normally we don't hit this path because LRU list walk is not handling inodes with 0 link count. But a race with unlink can make that happen with iput() from inode_lru_isolate(). I'm pondering about the best way to fix this. Maybe we could handle the need for inode pinning in inode_lru_isolate() in a similar way as in writeback code so that last iput() cannot happen from inode_lru_isolate(). In writeback we use I_SYNC flag to pin the inode and evict() waits for this flag to clear. I'll probably sleep to it and if I won't find it too disgusting to live tomorrow, I can code it. Honza
Hi, Jan 在 2024/7/18 21:40, Jan Kara 写道: > On Fri 12-07-24 10:37:08, Theodore Ts'o wrote: >> On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote: >>> Problem description >>> =================== >>> >>> The inode reclaiming process(See function prune_icache_sb) collects all >>> reclaimable inodes and mark them with I_FREEING flag at first, at that >>> time, other processes will be stuck if they try getting these inodes(See >>> function find_inode_fast), then the reclaiming process destroy the >>> inodes by function dispose_list(). >>> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may >>> do inode lookup in the inode evicting callback function, if the inode >>> lookup is operated under the inode lru traversing context, deadlock >>> problems may happen. >>> >>> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen >>> if ea_inode feature is enabled, the lookup process will be stuck under >>> the evicting context like this: >>> >>> 1. File A has inode i_reg and an ea inode i_ea >>> 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea >>> 3. Then, following three processes running like this: >>> >>> PA PB >>> echo 2 > /proc/sys/vm/drop_caches >>> shrink_slab >>> prune_dcache_sb >>> // i_reg is added into lru, lru->i_ea->i_reg >>> prune_icache_sb >>> list_lru_walk_one >>> inode_lru_isolate >>> i_ea->i_state |= I_FREEING // set inode state >>> i_ea->i_state |= I_FREEING // set inode state >> >> Um, I don't see how this can happen. If the ea_inode is in use, >> i_count will be greater than zero, and hence the inode will never be >> go down the rest of the path in inode_lru_inode(): >> >> if (atomic_read(&inode->i_count) || >> ...) { >> list_lru_isolate(lru, &inode->i_lru); >> spin_unlock(&inode->i_lock); >> this_cpu_dec(nr_unused); >> return LRU_REMOVED; >> } >> >> Do you have an actual reproduer which triggers this? Or would this >> happen be any chance something that was dreamed up with DEPT? > > No, it looks like a real problem and I agree with the analysis. We don't > hold ea_inode reference (i.e., ea_inode->i_count) from a normal inode. The > normal inode just owns that that special on-disk xattr reference. Standard > inode references are acquired and dropped as needed. > > And this is exactly the problem: ext4_xattr_inode_dec_ref_all() called from > evict() needs to lookup the ea_inode and iget() it. So if we are processing > a list of inodes to dispose, all inodes have I_FREEING bit already set and > if ea_inode and its parent normal inode are both in the list, then the > evict()->ext4_xattr_inode_dec_ref_all()->iget() will deadlock. Yes, absolutely right. > > Normally we don't hit this path because LRU list walk is not handling > inodes with 0 link count. But a race with unlink can make that happen with > iput() from inode_lru_isolate(). Another reason is that mapping_empty(&inode->i_data) is consistent with mapping_shrinkable(&inode->i_data) in most cases(CONFIG_HIGHMEM is disabled in default on 64bit platforms, so mapping_shrinkable() hardly returns true if file inode's mapping has pagecahes), the problem path expects that mapping_shrinkable() returns true and mapping_empty() returns false. Do we have any other methods to replace following if-branch without invoking __iget()? /* * On highmem systems, mapping_shrinkable() permits dropping * page cache in order to free up struct inodes: lowmem might * be under pressure before the cache inside the highmem zone. */ if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { __iget(inode); ... iput(inode); spin_lock(lru_lock); return LRU_RETRY; } > > I'm pondering about the best way to fix this. Maybe we could handle the > need for inode pinning in inode_lru_isolate() in a similar way as in > writeback code so that last iput() cannot happen from inode_lru_isolate(). > In writeback we use I_SYNC flag to pin the inode and evict() waits for this > flag to clear. I'll probably sleep to it and if I won't find it too > disgusting to live tomorrow, I can code it. > I guess that you may modify like this: diff --git a/fs/inode.c b/fs/inode.c index f356fe2ec2b6..5b1a9b23f53f 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold); static void __inode_add_lru(struct inode *inode, bool rotate) { - if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE)) + if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PINING)) return; if (atomic_read(&inode->i_count)) return; @@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, * be under pressure before the cache inside the highmem zone. */ if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { - __iget(inode); + inode->i_state |= I_PINING; spin_unlock(&inode->i_lock); spin_unlock(lru_lock); if (remove_inode_buffers(inode)) { @@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct list_head *item, __count_vm_events(PGINODESTEAL, reap); mm_account_reclaimed_pages(reap); } - iput(inode); + spin_lock(&inode->i_lock); + inode->i_state &= ~I_PINING; + wake_up_bit(&inode->i_state, __I_PINING); + spin_unlock(&inode->i_lock); spin_lock(lru_lock); return LRU_RETRY; } @@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode) return; } + inode_wait_for_pining(inode); state = inode->i_state; if (!drop) { WRITE_ONCE(inode->i_state, state | I_WILL_FREE); diff --git a/include/linux/fs.h b/include/linux/fs.h index fd34b5755c0b..daf094fff5fe 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, #define I_DONTCACHE (1 << 16) #define I_SYNC_QUEUED (1 << 17) #define I_PINNING_NETFS_WB (1 << 18) +#define __I_PINING 19 +#define I_PINING (1 << __I_PINING) #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) , which means that we will import a new inode state to solve the problem.
On Fri, Jul 19, 2024 at 11:21:51AM +0800, Zhihao Cheng wrote: > 在 2024/7/18 21:40, Jan Kara 写道: > > I'm pondering about the best way to fix this. Maybe we could handle the > > need for inode pinning in inode_lru_isolate() in a similar way as in > > writeback code so that last iput() cannot happen from inode_lru_isolate(). > > In writeback we use I_SYNC flag to pin the inode and evict() waits for this > > flag to clear. I'll probably sleep to it and if I won't find it too > > disgusting to live tomorrow, I can code it. > > > > I guess that you may modify like this: > diff --git a/fs/inode.c b/fs/inode.c > index f356fe2ec2b6..5b1a9b23f53f 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold); > > static void __inode_add_lru(struct inode *inode, bool rotate) > { > - if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | > I_WILL_FREE)) > + if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE > | I_PINING)) > return; > if (atomic_read(&inode->i_count)) > return; > @@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct > list_head *item, > * be under pressure before the cache inside the highmem zone. > */ > if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { > - __iget(inode); > + inode->i_state |= I_PINING; > spin_unlock(&inode->i_lock); > spin_unlock(lru_lock); > if (remove_inode_buffers(inode)) { > @@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct > list_head *item, > __count_vm_events(PGINODESTEAL, reap); > mm_account_reclaimed_pages(reap); > } > - iput(inode); > + spin_lock(&inode->i_lock); > + inode->i_state &= ~I_PINING; > + wake_up_bit(&inode->i_state, __I_PINING); > + spin_unlock(&inode->i_lock); > spin_lock(lru_lock); > return LRU_RETRY; > } > @@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode) > return; > } > > + inode_wait_for_pining(inode); > state = inode->i_state; > if (!drop) { > WRITE_ONCE(inode->i_state, state | I_WILL_FREE); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index fd34b5755c0b..daf094fff5fe 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb *kiocb, > struct kiocb *kiocb_src, > #define I_DONTCACHE (1 << 16) > #define I_SYNC_QUEUED (1 << 17) > #define I_PINNING_NETFS_WB (1 << 18) > +#define __I_PINING 19 > +#define I_PINING (1 << __I_PINING) > > #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) > #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) > > , which means that we will import a new inode state to solve the problem. > My non-maintainer $0,03 is as follows: 1. I_PINING is too generic of a name. I_LRU_PINNED or something else indicating what this is for would be prudent 2. while not specific to this patch, the handling of i_state is too accidental-breakage friendly. a full blown solution is way out of the scope here, but something can be done to future-proof this work anyway. To that end I would suggest: 1. inode_lru_pin() which appart from setting the flag includes: BUG_ON(inode->i_state & (I_LRU_PINNED | I_FREEING | I_WILL_FREE) 2. inode_lru_unpin() which apart from unsetting the flag + wakeup includes: BUG_ON(!(inode->i_state & I_LRU_PINNED)) 3. inode_lru_wait_for_pinned() However, a non-cosmetic remark is that at the spot inode_wait_for_pining gets invoked none of the of the pinning-blocking flags may be set (to my reading anyway). This is not the end of the world, but it does mean the waiting routine will have to check stuff in a loop. Names are not that important, the key is to keep the logic and dependencies close by code-wise.
Hi, based on the ideas from Jan and Mateusz, I sent a fix patch, see https://lore.kernel.org/linux-fsdevel/20240805013446.814357-1-chengzhihao@huaweicloud.com/T/#u 在 2024/7/12 14:27, Zhihao Cheng 写道: > Hi. Recently, we found a deadlock in inode recliaiming process caused by > circular dependence between file inode and file's xattr inode. > > Problem description > =================== > > The inode reclaiming process(See function prune_icache_sb) collects all > reclaimable inodes and mark them with I_FREEING flag at first, at that > time, other processes will be stuck if they try getting these inodes(See > function find_inode_fast), then the reclaiming process destroy the > inodes by function dispose_list(). > Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may > do inode lookup in the inode evicting callback function, if the inode > lookup is operated under the inode lru traversing context, deadlock > problems may happen. >
diff --git a/fs/inode.c b/fs/inode.c index 3a41f83a4ba5..c649be22f841 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -843,7 +844,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, * be under pressure before the cache inside the highmem zone. */ if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { - __iget(inode); + inode->i_state |= I_WILL_FREE; spin_unlock(&inode->i_lock); spin_unlock(lru_lock);