Message ID | 20230719221918.8937-4-krisman@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Support negative dentries on case-insensitive ext4 and f2fs | expand |
On Wed, Jul 19, 2023 at 06:19:14PM -0400, Gabriel Krisman Bertazi wrote: > +static inline int generic_ci_d_revalidate(struct dentry *dentry, > + const struct qstr *name, > + unsigned int flags) This shouldn't be 'inline', since it can't be inlined into anywhere anyway. > + if (dir && needs_casefold(dir)) { > + /* > + * Filesystems will call into d_revalidate without > + * setting LOOKUP_ flags even for file creation(see > + * lookup_one* variants). Reject negative dentries in > + * this case, since we can't know for sure it won't be > + * used for creation. > + */ > + if (!flags) > + return 0; > + > + /* > + * Negative dentries created prior to turning the > + * directory case-insensitive cannot be trusted, since > + * they don't ensure any possible case version of the > + * filename doesn't exist. > + */ > + if (!d_is_casefold_lookup(dentry)) > + return 0; > + > + if (flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET)) { > + /* > + * ->d_name won't change from under us in the > + * creation path only, since d_revalidate during > + * creation and renames is always called with > + * the parent inode locked. It isn't the case > + * for all lookup callpaths, so ->d_name must > + * not be touched outside > + * (LOOKUP_CREATE|LOOKUP_RENAME_TARGET) context. > + */ > + if (dentry->d_name.len != name->len || > + memcmp(dentry->d_name.name, name->name, name->len)) > + return 0; > + } > + } This would be easier to follow if the '!flags' and 'flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET)' sections were adjacent to each other. They group together logically, since they both deal with the following problem: "when the lookup might be for creation, invalidate the negative dentry if it might not be a case-sensitive match". (And it would help if there was a comment explaining that problem.) The d_is_casefold_lookup() check solves a different problem. I'm also having trouble understanding exactly when ->d_name is stable here. AFAICS, unfortunately the VFS has an edge case where a dentry can be moved without its parent's ->i_rwsem being held. It happens when a subdirectory is "found" under multiple names. The VFS doesn't support directory hard links, so if it finds a second link to a directory, it just moves the whole dentry tree to the new location. This can happen if a filesystem image is corrupted and contains directory hard links. Coincidentally, it can also happen in an encrypted directory due to the no-key name => normal name transition... But, negative dentries are never moved, at all. (__d_move() even WARNs if you ask it to move a negative dentry.) That feels like that would make everything else irrelevant here, though your patchset doesn't mention this. I suppose the problem would be what if the dentry is made positive concurrently. I'm struggling to find an ideal solution here. Maybe ->d_lock should just be taken for the name comparison. That is legal here, and it reliably synchronizes with the dentry being moved, without having to consider anything else. So, the following is probably what I'd do. I'd be interested to hear your thoughts, though: /* * A negative dentry that was created before the * directory became case-insensitive can't be trusted, * since it doesn't ensure any possible case version of * the filename doesn't exist. */ if (!d_is_casefold_lookup(dentry)) return 0; /* * If the lookup is for creation, then a negative dentry * can only be reused if it's a case-sensitive match, * not just a case-insensitive one. This is required to * provide case-preserving semantics. * * In some cases (lookup_one_*()), the lookup intent is * unknown, resulting in flags==0 here. So we have to * assume that flags==0 is potentially a creation. */ if (flags == 0 || (flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))) { bool match; spin_lock(&dentry->d_lock); match = (dentry->d_name.len == name->len && memcmp(dentry->d_name.name, name->name, name->len) == 0); spin_unlock(&dentry->d_lock); if (!match) return 0; }
On Wed, Jul 19, 2023 at 11:06:57PM -0700, Eric Biggers wrote: > > I'm also having trouble understanding exactly when ->d_name is stable here. > AFAICS, unfortunately the VFS has an edge case where a dentry can be moved > without its parent's ->i_rwsem being held. It happens when a subdirectory is > "found" under multiple names. The VFS doesn't support directory hard links, so > if it finds a second link to a directory, it just moves the whole dentry tree to > the new location. This can happen if a filesystem image is corrupted and > contains directory hard links. Coincidentally, it can also happen in an > encrypted directory due to the no-key name => normal name transition... Sorry, I think I got this slightly wrong. The move does happen with the parent's ->i_rwsem held, but it's for read, not for write. First, before ->lookup is called, the ->i_rwsem of the parent directory is taken for read. ->lookup() calls d_splice_alias() which can call __d_unalias() which does the __d_move(). If the old alias is in a different directory (which cannot happen in that fscrypt case, but can happen in the general "directory hard links" case), __d_unalias() takes that directory's ->i_rwsem for read too. So it looks like the parent's ->i_rwsem does indeed exclude moves of child dentries, but only if it's taken for *write*. So I guess you can rely on that; it's just a bit more subtle than it first appears. Though, some of your explanation seems to assume that a read lock is sufficient ("In __lookup_slow, either the parent inode is locked by the caller (lookup_slow) ..."), so maybe there is still a problem. - Eric
Eric Biggers <ebiggers@kernel.org> writes: > On Wed, Jul 19, 2023 at 11:06:57PM -0700, Eric Biggers wrote: >> >> I'm also having trouble understanding exactly when ->d_name is stable here. >> AFAICS, unfortunately the VFS has an edge case where a dentry can be moved >> without its parent's ->i_rwsem being held. It happens when a subdirectory is >> "found" under multiple names. The VFS doesn't support directory hard links, so >> if it finds a second link to a directory, it just moves the whole dentry tree to >> the new location. This can happen if a filesystem image is corrupted and >> contains directory hard links. Coincidentally, it can also happen in an >> encrypted directory due to the no-key name => normal name transition... > > Sorry, I think I got this slightly wrong. The move does happen with the > parent's ->i_rwsem held, but it's for read, not for write. First, before > ->lookup is called, the ->i_rwsem of the parent directory is taken for read. > ->lookup() calls d_splice_alias() which can call __d_unalias() which does the > __d_move(). If the old alias is in a different directory (which cannot happen > in that fscrypt case, but can happen in the general "directory hard links" > case), __d_unalias() takes that directory's ->i_rwsem for read too. > > So it looks like the parent's ->i_rwsem does indeed exclude moves of child > dentries, but only if it's taken for *write*. So I guess you can rely on that; > it's just a bit more subtle than it first appears. Though, some of your > explanation seems to assume that a read lock is sufficient ("In __lookup_slow, > either the parent inode is locked by the caller (lookup_slow) ..."), so maybe > there is still a problem. I think I'm missing something on your clarification. I see your point about __d_unalias, and I see in the case where alias->d_parent != dentry->d_parent we acquire the parent inode read lock: static int __d_unalias(struct inode *inode, struct dentry *dentry, struct dentry *alias) { ... m1 = &dentry->d_sb->s_vfs_rename_mutex; if (!inode_trylock_shared(alias->d_parent->d_inode)) goto out_err; } And it seems to use that for __d_move. In this case, __d_move changes from under us even with a read lock, which is dangerous. I think I agree with your first email more than the clarification. In the lookup_slow then: lookup_slow() d_lookup() d_splice_alias() __d_unalias() __d_move() this __d_move Can do a dentry move and race with d_revalidate even though it has the parent read lock. > So it looks like the parent's ->i_rwsem does indeed exclude moves of child > dentries, but only if it's taken for *write*. So I guess you can rely on that; We can get away of it with acquiring the d_lock as you suggested, I think. But can you clarify the above? I wanna make sure I didn't miss anything. I am indeed relying only on the read lock here, as you can see.
On Fri, Jul 21, 2023 at 04:16:30PM -0400, Gabriel Krisman Bertazi wrote: > Eric Biggers <ebiggers@kernel.org> writes: > > > On Wed, Jul 19, 2023 at 11:06:57PM -0700, Eric Biggers wrote: > >> > >> I'm also having trouble understanding exactly when ->d_name is stable here. > >> AFAICS, unfortunately the VFS has an edge case where a dentry can be moved > >> without its parent's ->i_rwsem being held. It happens when a subdirectory is > >> "found" under multiple names. The VFS doesn't support directory hard links, so > >> if it finds a second link to a directory, it just moves the whole dentry tree to > >> the new location. This can happen if a filesystem image is corrupted and > >> contains directory hard links. Coincidentally, it can also happen in an > >> encrypted directory due to the no-key name => normal name transition... > > > > Sorry, I think I got this slightly wrong. The move does happen with the > > parent's ->i_rwsem held, but it's for read, not for write. First, before > > ->lookup is called, the ->i_rwsem of the parent directory is taken for read. > > ->lookup() calls d_splice_alias() which can call __d_unalias() which does the > > __d_move(). If the old alias is in a different directory (which cannot happen > > in that fscrypt case, but can happen in the general "directory hard links" > > case), __d_unalias() takes that directory's ->i_rwsem for read too. > > > > So it looks like the parent's ->i_rwsem does indeed exclude moves of child > > dentries, but only if it's taken for *write*. So I guess you can rely on that; > > it's just a bit more subtle than it first appears. Though, some of your > > explanation seems to assume that a read lock is sufficient ("In __lookup_slow, > > either the parent inode is locked by the caller (lookup_slow) ..."), so maybe > > there is still a problem. > > I think I'm missing something on your clarification. I see your point > about __d_unalias, and I see in the case where alias->d_parent != > dentry->d_parent we acquire the parent inode read lock: > > static int __d_unalias(struct inode *inode, > struct dentry *dentry, struct dentry *alias) > { > ... > m1 = &dentry->d_sb->s_vfs_rename_mutex; > if (!inode_trylock_shared(alias->d_parent->d_inode)) > goto out_err; > } > > And it seems to use that for __d_move. In this case, __d_move changes > from under us even with a read lock, which is dangerous. I think I > agree with your first email more than the clarification. > > In the lookup_slow then: > > lookup_slow() > d_lookup() > d_splice_alias() > __d_unalias() > __d_move() > > this __d_move Can do a dentry move and race with d_revalidate even > though it has the parent read lock. > > > So it looks like the parent's ->i_rwsem does indeed exclude moves of child > > dentries, but only if it's taken for *write*. So I guess you can rely on that; > > We can get away of it with acquiring the d_lock as you suggested, I > think. But can you clarify the above? I wanna make sure I didn't miss > anything. I am indeed relying only on the read lock here, as you can see. In my first email I thought that __d_move() can be called without the parent inode's i_rwsem held at all. In my second email I realized that it is always called with at least a read (shared) lock. The question is what kind of parent i_rwsem lock is guaranteed to be held by the caller of ->d_revalidate() when the name comparison is done. Based on the above, it needs to be at least a write (exclusive) lock to exclude __d_move() without taking d_lock. However, your analysis (in the commit message of "libfs: Validate negative dentries in case-insensitive directories") only talks about i_rwsem being "taken", without saying whether it's for read or write. One thing you mentioned as taking i_rwsem is lookup_slow, but that only takes it for read. That seems like a problem, as it makes your analysis not correct. - Eric
Eric Biggers <ebiggers@kernel.org> writes: > On Fri, Jul 21, 2023 at 04:16:30PM -0400, Gabriel Krisman Bertazi wrote: >> Eric Biggers <ebiggers@kernel.org> writes: >> >> > On Wed, Jul 19, 2023 at 11:06:57PM -0700, Eric Biggers wrote: >> >> >> >> I'm also having trouble understanding exactly when ->d_name is stable here. >> >> AFAICS, unfortunately the VFS has an edge case where a dentry can be moved >> >> without its parent's ->i_rwsem being held. It happens when a subdirectory is >> >> "found" under multiple names. The VFS doesn't support directory hard links, so >> >> if it finds a second link to a directory, it just moves the whole dentry tree to >> >> the new location. This can happen if a filesystem image is corrupted and >> >> contains directory hard links. Coincidentally, it can also happen in an >> >> encrypted directory due to the no-key name => normal name transition... >> > >> > Sorry, I think I got this slightly wrong. The move does happen with the >> > parent's ->i_rwsem held, but it's for read, not for write. First, before >> > ->lookup is called, the ->i_rwsem of the parent directory is taken for read. >> > ->lookup() calls d_splice_alias() which can call __d_unalias() which does the >> > __d_move(). If the old alias is in a different directory (which cannot happen >> > in that fscrypt case, but can happen in the general "directory hard links" >> > case), __d_unalias() takes that directory's ->i_rwsem for read too. >> > >> > So it looks like the parent's ->i_rwsem does indeed exclude moves of child >> > dentries, but only if it's taken for *write*. So I guess you can rely on that; >> > it's just a bit more subtle than it first appears. Though, some of your >> > explanation seems to assume that a read lock is sufficient ("In __lookup_slow, >> > either the parent inode is locked by the caller (lookup_slow) ..."), so maybe >> > there is still a problem. >> >> I think I'm missing something on your clarification. I see your point >> about __d_unalias, and I see in the case where alias->d_parent != >> dentry->d_parent we acquire the parent inode read lock: >> >> static int __d_unalias(struct inode *inode, >> struct dentry *dentry, struct dentry *alias) >> { >> ... >> m1 = &dentry->d_sb->s_vfs_rename_mutex; >> if (!inode_trylock_shared(alias->d_parent->d_inode)) >> goto out_err; >> } >> this __d_move Can do a dentry move and race with d_revalidate even >> though it has the parent read lock. >> >> > So it looks like the parent's ->i_rwsem does indeed exclude moves of child >> > dentries, but only if it's taken for *write*. So I guess you can rely on that; >> >> We can get away of it with acquiring the d_lock as you suggested, I >> think. But can you clarify the above? I wanna make sure I didn't miss >> anything. I am indeed relying only on the read lock here, as you can see. > > In my first email I thought that __d_move() can be called without the parent > inode's i_rwsem held at all. In my second email I realized that it is always > called with at least a read (shared) lock. I see. Thank you. We are on the same page now. I was confused by this part of your second comment: >> > I guess you can rely on that; it's just a bit more subtle than it >> > first appears. Though, some of your explanation seems to assume >> > that a read lock is sufficient ("In __lookup_slow, either the >> > parent inode is locked by the caller (lookup_slow) ..."), ...because I was then failing to see, after learning about the __d_move case, how I could rely on the inode read lock. But, as I now realize, __d_move is not called for negative dentries, so lookup_slow is indeed safe. > The question is what kind of parent i_rwsem lock is guaranteed to be held by the > caller of ->d_revalidate() when the name comparison is done. Based on the > above, it needs to be at least a write (exclusive) lock to exclude __d_move() > without taking d_lock. However, your analysis (in the commit message of "libfs: > Validate negative dentries in case-insensitive directories") only talks about > i_rwsem being "taken", without saying whether it's for read or write. One thing > you mentioned as taking i_rwsem is lookup_slow, but that only takes it for read. > That seems like a problem, as it makes your analysis not correct. My understanding and explanation was that a read lock should be enough at all times, despite the __d_move case. Any time d_revalidate is called for a (LOOKUP_CREATE | LOOKUP_RENAME_TARGET), it holds at least the read lock, preventing concurrent changes to d_name of negative dentries. I will review the places that touch ->d_name again and I will keep the patch as-is and update my explanation to include this case.
diff --git a/fs/libfs.c b/fs/libfs.c index 5b851315eeed..dd213f446427 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -1462,9 +1462,57 @@ static int generic_ci_d_hash(const struct dentry *dentry, struct qstr *str) return 0; } +static inline int generic_ci_d_revalidate(struct dentry *dentry, + const struct qstr *name, + unsigned int flags) +{ + if (d_is_negative(dentry)) { + const struct dentry *parent = READ_ONCE(dentry->d_parent); + const struct inode *dir = READ_ONCE(parent->d_inode); + + if (dir && needs_casefold(dir)) { + /* + * Filesystems will call into d_revalidate without + * setting LOOKUP_ flags even for file creation(see + * lookup_one* variants). Reject negative dentries in + * this case, since we can't know for sure it won't be + * used for creation. + */ + if (!flags) + return 0; + + /* + * Negative dentries created prior to turning the + * directory case-insensitive cannot be trusted, since + * they don't ensure any possible case version of the + * filename doesn't exist. + */ + if (!d_is_casefold_lookup(dentry)) + return 0; + + if (flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET)) { + /* + * ->d_name won't change from under us in the + * creation path only, since d_revalidate during + * creation and renames is always called with + * the parent inode locked. It isn't the case + * for all lookup callpaths, so ->d_name must + * not be touched outside + * (LOOKUP_CREATE|LOOKUP_RENAME_TARGET) context. + */ + if (dentry->d_name.len != name->len || + memcmp(dentry->d_name.name, name->name, name->len)) + return 0; + } + } + } + return 1; +} + static const struct dentry_operations generic_ci_dentry_ops = { .d_hash = generic_ci_d_hash, .d_compare = generic_ci_d_compare, + .d_revalidate_name = generic_ci_d_revalidate, }; #endif