Message ID | 20240406090930.2252838-9-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs & block: remove bdev->bd_inode | expand |
On Sat, Apr 06, 2024 at 05:09:12PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > Now that all filesystems stash the bdev file, it's ok to get inode > for the file. Looking at the only user of erofs_buf->inode (erofs_bread())... We use the inode for two things there - block size calculation (to get from block number to position in bytes) and access to page cache. We read in full pages anyway. And frankly, looking at the callers, we really would be better off if we passed position in bytes instead of block number. IOW, it smells like erofs_bread() having wrong type. Look at the callers. With 3 exceptions it's fs/erofs/super.c:135: ptr = erofs_bread(buf, erofs_blknr(sb, *offset), EROFS_KMAP); fs/erofs/super.c:151: ptr = erofs_bread(buf, erofs_blknr(sb, *offset), EROFS_KMAP); fs/erofs/xattr.c:84: it.kaddr = erofs_bread(&it.buf, erofs_blknr(sb, it.pos), EROFS_KMAP); fs/erofs/xattr.c:105: it.kaddr = erofs_bread(&it.buf, erofs_blknr(sb, it.pos), fs/erofs/xattr.c:188: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), fs/erofs/xattr.c:294: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), fs/erofs/xattr.c:339: it->kaddr = erofs_bread(&it->buf, erofs_blknr(it->sb, it->pos), fs/erofs/xattr.c:378: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), fs/erofs/zdata.c:943: src = erofs_bread(&buf, erofs_blknr(sb, pos), EROFS_KMAP); and all of them actually want the return value + erofs_offset(...). IOW, we take a linear position (in bytes). Divide it by block size (from sb). Pass the factor to erofs_bread(), where we multiply that by block size (from inode), see which page will that be in, get that page and return a pointer *into* that page. Then we again divide the same position by block size (from sb) and add the remainder to the pointer returned by erofs_bread(). IOW, it would be much easier to pass the position directly and to hell with block size logics. Three exceptions to that pattern: fs/erofs/data.c:80: return erofs_bread(buf, blkaddr, type); fs/erofs/dir.c:66: de = erofs_bread(&buf, i, EROFS_KMAP); fs/erofs/namei.c:103: de = erofs_bread(&buf, mid, EROFS_KMAP); Those could bloody well multiply the argument by block size; the first one (erofs_read_metabuf()) is also interesting - its callers themselves follow the similar pattern. So it might be worth passing it a position in bytes as well... In any case, all 3 have superblock reference, so they can convert from blocks to bytes conveniently. Which means that erofs_bread() doesn't need to mess with block size considerations at all. IOW, it might make sense to replace erofs_buf->inode with pointer to address space. And use file_mapping() instead of file_inode() in that patch...
On Sun, Apr 07, 2024 at 05:05:31AM +0100, Al Viro wrote: > IOW, it might make sense to replace erofs_buf->inode with > pointer to address space. And use file_mapping() instead of ->f_mapping, that is. > file_inode() in that patch... >
Hi Al, On 2024/4/7 12:05, Al Viro wrote: > On Sat, Apr 06, 2024 at 05:09:12PM +0800, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> Now that all filesystems stash the bdev file, it's ok to get inode >> for the file. > > Looking at the only user of erofs_buf->inode (erofs_bread())... We > use the inode for two things there - block size calculation (to get > from block number to position in bytes) and access to page cache. > We read in full pages anyway. And frankly, looking at the callers, > we really would be better off if we passed position in bytes instead > of block number. IOW, it smells like erofs_bread() having wrong type. > > Look at the callers. With 3 exceptions it's > fs/erofs/super.c:135: ptr = erofs_bread(buf, erofs_blknr(sb, *offset), EROFS_KMAP); > fs/erofs/super.c:151: ptr = erofs_bread(buf, erofs_blknr(sb, *offset), EROFS_KMAP); > fs/erofs/xattr.c:84: it.kaddr = erofs_bread(&it.buf, erofs_blknr(sb, it.pos), EROFS_KMAP); > fs/erofs/xattr.c:105: it.kaddr = erofs_bread(&it.buf, erofs_blknr(sb, it.pos), > fs/erofs/xattr.c:188: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), > fs/erofs/xattr.c:294: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), > fs/erofs/xattr.c:339: it->kaddr = erofs_bread(&it->buf, erofs_blknr(it->sb, it->pos), > fs/erofs/xattr.c:378: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), > fs/erofs/zdata.c:943: src = erofs_bread(&buf, erofs_blknr(sb, pos), EROFS_KMAP); > > and all of them actually want the return value + erofs_offset(...). IOW, > we take a linear position (in bytes). Divide it by block size (from sb). > Pass the factor to erofs_bread(), where we multiply that by block size > (from inode), see which page will that be in, get that page and return a > pointer *into* that page. Then we again divide the same position > by block size (from sb) and add the remainder to the pointer returned > by erofs_bread(). > > IOW, it would be much easier to pass the position directly and to hell > with block size logics. Three exceptions to that pattern: > > fs/erofs/data.c:80: return erofs_bread(buf, blkaddr, type); > fs/erofs/dir.c:66: de = erofs_bread(&buf, i, EROFS_KMAP); > fs/erofs/namei.c:103: de = erofs_bread(&buf, mid, EROFS_KMAP); > > Those could bloody well multiply the argument by block size; > the first one (erofs_read_metabuf()) is also interesting - its > callers themselves follow the similar pattern. So it might be > worth passing it a position in bytes as well... > > In any case, all 3 have superblock reference, so they can convert > from blocks to bytes conveniently. Which means that erofs_bread() > doesn't need to mess with block size considerations at all. > > IOW, it might make sense to replace erofs_buf->inode with > pointer to address space. And use file_mapping() instead of > file_inode() in that patch... Just saw this again by chance, which is unexpected. Yeah, I think that is a good idea. The story is that erofs_bread() was derived from a page-based interface: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/data.c?h=v5.10#n35 so it was once a page index number. I think a byte offset will be a better interface to clean up these, thanks for your time and work on this! BTW, sightly off the topic: I'm little confused why I'm not be looped for this version this time even: 1) I explicitly asked to Cc the mailing list so that I could find the latest discussion and respond in time: https://lore.kernel.org/r/5e04a86d-8bbd-41da-95f6-cf1562ed04f9@linux.alibaba.com 2) I sent my r-v-b tag on RFC v4 (and the tag was added on this version) but I didn't receive this new version. Thanks, Gao Xiang
Hi, 在 2024/04/12 0:13, Gao Xiang 写道: > Hi Al, > > On 2024/4/7 12:05, Al Viro wrote: >> On Sat, Apr 06, 2024 at 05:09:12PM +0800, Yu Kuai wrote: >>> From: Yu Kuai <yukuai3@huawei.com> >>> >>> Now that all filesystems stash the bdev file, it's ok to get inode >>> for the file. >> >> Looking at the only user of erofs_buf->inode (erofs_bread())... We >> use the inode for two things there - block size calculation (to get >> from block number to position in bytes) and access to page cache. >> We read in full pages anyway. And frankly, looking at the callers, >> we really would be better off if we passed position in bytes instead >> of block number. IOW, it smells like erofs_bread() having wrong type. >> >> Look at the callers. With 3 exceptions it's >> fs/erofs/super.c:135: ptr = erofs_bread(buf, erofs_blknr(sb, >> *offset), EROFS_KMAP); >> fs/erofs/super.c:151: ptr = erofs_bread(buf, erofs_blknr(sb, >> *offset), EROFS_KMAP); >> fs/erofs/xattr.c:84: it.kaddr = erofs_bread(&it.buf, >> erofs_blknr(sb, it.pos), EROFS_KMAP); >> fs/erofs/xattr.c:105: it.kaddr = erofs_bread(&it.buf, >> erofs_blknr(sb, it.pos), >> fs/erofs/xattr.c:188: it->kaddr = erofs_bread(&it->buf, >> erofs_blknr(sb, it->pos), >> fs/erofs/xattr.c:294: it->kaddr = erofs_bread(&it->buf, >> erofs_blknr(sb, it->pos), >> fs/erofs/xattr.c:339: it->kaddr = erofs_bread(&it->buf, >> erofs_blknr(it->sb, it->pos), >> fs/erofs/xattr.c:378: it->kaddr = erofs_bread(&it->buf, >> erofs_blknr(sb, it->pos), >> fs/erofs/zdata.c:943: src = erofs_bread(&buf, >> erofs_blknr(sb, pos), EROFS_KMAP); >> >> and all of them actually want the return value + erofs_offset(...). IOW, >> we take a linear position (in bytes). Divide it by block size (from sb). >> Pass the factor to erofs_bread(), where we multiply that by block size >> (from inode), see which page will that be in, get that page and return a >> pointer *into* that page. Then we again divide the same position >> by block size (from sb) and add the remainder to the pointer returned >> by erofs_bread(). >> >> IOW, it would be much easier to pass the position directly and to hell >> with block size logics. Three exceptions to that pattern: >> >> fs/erofs/data.c:80: return erofs_bread(buf, blkaddr, type); >> fs/erofs/dir.c:66: de = erofs_bread(&buf, i, EROFS_KMAP); >> fs/erofs/namei.c:103: de = erofs_bread(&buf, mid, EROFS_KMAP); >> >> Those could bloody well multiply the argument by block size; >> the first one (erofs_read_metabuf()) is also interesting - its >> callers themselves follow the similar pattern. So it might be >> worth passing it a position in bytes as well... >> >> In any case, all 3 have superblock reference, so they can convert >> from blocks to bytes conveniently. Which means that erofs_bread() >> doesn't need to mess with block size considerations at all. >> >> IOW, it might make sense to replace erofs_buf->inode with >> pointer to address space. And use file_mapping() instead of >> file_inode() in that patch... > > Just saw this again by chance, which is unexpected. > > Yeah, I think that is a good idea. The story is that erofs_bread() > was derived from a page-based interface: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/data.c?h=v5.10#n35 > > > so it was once a page index number. I think a byte offset will be > a better interface to clean up these, thanks for your time and work > on this! > > BTW, sightly off the topic: > > I'm little confused why I'm not be looped for this version this time > even: > > 1) I explicitly asked to Cc the mailing list so that I could find > the latest discussion and respond in time: > > https://lore.kernel.org/r/5e04a86d-8bbd-41da-95f6-cf1562ed04f9@linux.alibaba.com > > > 2) I sent my r-v-b tag on RFC v4 (and the tag was added on this > version) but I didn't receive this new version. This is my fault to blame, I gave up to cc all address from get_maintainer.pl for this set, because I somehow can't send this set with too much CC. However, I should still CC you. Thanks, Kuai > > Thanks, > Gao Xiang > . >
On Fri, Apr 12, 2024 at 12:13:42AM +0800, Gao Xiang wrote: > Hi Al, > > On 2024/4/7 12:05, Al Viro wrote: > > On Sat, Apr 06, 2024 at 05:09:12PM +0800, Yu Kuai wrote: > > > From: Yu Kuai <yukuai3@huawei.com> > > > > > > Now that all filesystems stash the bdev file, it's ok to get inode > > > for the file. > > > > Looking at the only user of erofs_buf->inode (erofs_bread())... We > > use the inode for two things there - block size calculation (to get > > from block number to position in bytes) and access to page cache. > > We read in full pages anyway. And frankly, looking at the callers, > > we really would be better off if we passed position in bytes instead > > of block number. IOW, it smells like erofs_bread() having wrong type. > > > > Look at the callers. With 3 exceptions it's > > fs/erofs/super.c:135: ptr = erofs_bread(buf, erofs_blknr(sb, *offset), EROFS_KMAP); > > fs/erofs/super.c:151: ptr = erofs_bread(buf, erofs_blknr(sb, *offset), EROFS_KMAP); > > fs/erofs/xattr.c:84: it.kaddr = erofs_bread(&it.buf, erofs_blknr(sb, it.pos), EROFS_KMAP); > > fs/erofs/xattr.c:105: it.kaddr = erofs_bread(&it.buf, erofs_blknr(sb, it.pos), > > fs/erofs/xattr.c:188: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), > > fs/erofs/xattr.c:294: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), > > fs/erofs/xattr.c:339: it->kaddr = erofs_bread(&it->buf, erofs_blknr(it->sb, it->pos), > > fs/erofs/xattr.c:378: it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos), > > fs/erofs/zdata.c:943: src = erofs_bread(&buf, erofs_blknr(sb, pos), EROFS_KMAP); > > > > and all of them actually want the return value + erofs_offset(...). IOW, > > we take a linear position (in bytes). Divide it by block size (from sb). > > Pass the factor to erofs_bread(), where we multiply that by block size > > (from inode), see which page will that be in, get that page and return a > > pointer *into* that page. Then we again divide the same position > > by block size (from sb) and add the remainder to the pointer returned > > by erofs_bread(). > > > > IOW, it would be much easier to pass the position directly and to hell > > with block size logics. Three exceptions to that pattern: > > > > fs/erofs/data.c:80: return erofs_bread(buf, blkaddr, type); > > fs/erofs/dir.c:66: de = erofs_bread(&buf, i, EROFS_KMAP); > > fs/erofs/namei.c:103: de = erofs_bread(&buf, mid, EROFS_KMAP); > > > > Those could bloody well multiply the argument by block size; > > the first one (erofs_read_metabuf()) is also interesting - its > > callers themselves follow the similar pattern. So it might be > > worth passing it a position in bytes as well... > > > > In any case, all 3 have superblock reference, so they can convert > > from blocks to bytes conveniently. Which means that erofs_bread() > > doesn't need to mess with block size considerations at all. > > > > IOW, it might make sense to replace erofs_buf->inode with > > pointer to address space. And use file_mapping() instead of > > file_inode() in that patch... > > Just saw this again by chance, which is unexpected. > > Yeah, I think that is a good idea. The story is that erofs_bread() > was derived from a page-based interface: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/data.c?h=v5.10#n35 > > so it was once a page index number. I think a byte offset will be > a better interface to clean up these, thanks for your time and work > on this! FWIW, see #misc.erofs and #more.erofs in my tree; the former is the minimal conversion of erofs_read_buf() and switch from buf->inode to buf->mapping, the latter follows that up with massage for erofs_read_metabuf(). Completely untested; it builds, but that's all I can promise. Individual patches in followups.
On Thu, Apr 25, 2024 at 08:56:41PM +0100, Al Viro wrote: > FWIW, see #misc.erofs and #more.erofs in my tree; the former is the > minimal conversion of erofs_read_buf() and switch from buf->inode > to buf->mapping, the latter follows that up with massage for > erofs_read_metabuf(). First two and last four patches resp. BTW, what are the intended rules for inline symlinks? "Should fit within the same block as the last byte of on-disk erofs_inode_{compact,extended}"? Feels like erofs_read_inode() might be better off if it did copying the symlink body instead of leaving it to erofs_fill_symlink(), complete with the sanity checks... I'd left that logics alone, though - I'm nowhere near familiar enough with erofs layout.
Hi Al, On 2024/4/26 04:08, Al Viro wrote: > On Thu, Apr 25, 2024 at 08:56:41PM +0100, Al Viro wrote: > >> FWIW, see #misc.erofs and #more.erofs in my tree; the former is the >> minimal conversion of erofs_read_buf() and switch from buf->inode >> to buf->mapping, the latter follows that up with massage for >> erofs_read_metabuf(). > > First two and last four patches resp. BTW, what are the intended rules > for inline symlinks? "Should fit within the same block as the last symlink on-disk layout follows the same rule of regular files. The last logical block can be inlined right after the on-disk inode (called tail packing inline) or use a separate fs block to keep the symlink if tail packing inline doesn't fit. > byte of on-disk erofs_inode_{compact,extended}"? Feels like > erofs_read_inode() might be better off if it did copying the symlink > body instead of leaving it to erofs_fill_symlink(), complete with > the sanity checks... I'd left that logics alone, though - I'm nowhere > near familiar enough with erofs layout. If I understand correctly, do you mean just fold erofs_fill_symlink() into the caller? That is fine with me, I can change this in the future. Thanks, Gao Xiang
On Fri, Apr 26, 2024 at 05:56:52AM +0800, Gao Xiang wrote: > Hi Al, > > On 2024/4/26 04:08, Al Viro wrote: > > On Thu, Apr 25, 2024 at 08:56:41PM +0100, Al Viro wrote: > > > > > FWIW, see #misc.erofs and #more.erofs in my tree; the former is the > > > minimal conversion of erofs_read_buf() and switch from buf->inode > > > to buf->mapping, the latter follows that up with massage for > > > erofs_read_metabuf(). > > > > First two and last four patches resp. BTW, what are the intended rules > > for inline symlinks? "Should fit within the same block as the last > > symlink on-disk layout follows the same rule of regular files. The last > logical block can be inlined right after the on-disk inode (called tail > packing inline) or use a separate fs block to keep the symlink if tail > packing inline doesn't fit. > > > byte of on-disk erofs_inode_{compact,extended}"? Feels like > > erofs_read_inode() might be better off if it did copying the symlink > > body instead of leaving it to erofs_fill_symlink(), complete with > > the sanity checks... I'd left that logics alone, though - I'm nowhere > > near familiar enough with erofs layout. > If I understand correctly, do you mean just fold erofs_fill_symlink() > into the caller? That is fine with me, I can change this in the > future. It's just that the calling conventions of erofs_read_inode() feel wrong ;-/ We return a pointer and offset, with (ERR_PTR(...), anything) used to indicate an error and (pointer into page, offset) used (in case of fast symlinks and only in case of fast symlinks) to encode the address of symlink body, with data starting at pointer + offset + vi->xattr_isize and length being ->i_size, no greater than block size - offset - vi->xattr_size. If anything, it would be easier to follow (and document) if you had allocated and filled the symlink body right there in erofs_read_inode(). That way you could lift erofs_put_metabuf() call into erofs_read_inode(), along with the variable itself. Perhaps something like void *erofs_read_inode(inode) with ERR_PTR(-E...) => error NULL => success, not a fast symlink pointer to string => success, a fast symlink, body allocated and returned to caller. Or, for that matter, have it return an int and stuff the body into ->i_link - it's just that you'd need to set ->i_op there with such approach. Not sure, really. BTW, one comment about erofs_fill_symlink() - it's probably a good idea to use kmemdup_nul() rather than open-coding it (and do that after the block overflow check, obviously).
On 2024/4/26 06:28, Al Viro wrote: > On Fri, Apr 26, 2024 at 05:56:52AM +0800, Gao Xiang wrote: >> Hi Al, >> >> On 2024/4/26 04:08, Al Viro wrote: >>> On Thu, Apr 25, 2024 at 08:56:41PM +0100, Al Viro wrote: >>> >>>> FWIW, see #misc.erofs and #more.erofs in my tree; the former is the >>>> minimal conversion of erofs_read_buf() and switch from buf->inode >>>> to buf->mapping, the latter follows that up with massage for >>>> erofs_read_metabuf(). >>> >>> First two and last four patches resp. BTW, what are the intended rules >>> for inline symlinks? "Should fit within the same block as the last >> >> symlink on-disk layout follows the same rule of regular files. The last >> logical block can be inlined right after the on-disk inode (called tail >> packing inline) or use a separate fs block to keep the symlink if tail >> packing inline doesn't fit. >> >>> byte of on-disk erofs_inode_{compact,extended}"? Feels like >>> erofs_read_inode() might be better off if it did copying the symlink >>> body instead of leaving it to erofs_fill_symlink(), complete with >>> the sanity checks... I'd left that logics alone, though - I'm nowhere >>> near familiar enough with erofs layout. >> If I understand correctly, do you mean just fold erofs_fill_symlink() >> into the caller? That is fine with me, I can change this in the >> future. > > It's just that the calling conventions of erofs_read_inode() feel wrong ;-/ > We return a pointer and offset, with (ERR_PTR(...), anything) used to > indicate an error and (pointer into page, offset) used (in case of > fast symlinks and only in case of fast symlinks) to encode the address > of symlink body, with data starting at pointer + offset + vi->xattr_isize > and length being ->i_size, no greater than block size - offset - vi->xattr_size. > > If anything, it would be easier to follow (and document) if you had > allocated and filled the symlink body right there in erofs_read_inode(). > That way you could lift erofs_put_metabuf() call into erofs_read_inode(), > along with the variable itself. > > Perhaps something like void *erofs_read_inode(inode) with > ERR_PTR(-E...) => error > NULL => success, not a fast symlink > pointer to string => success, a fast symlink, body allocated and returned > to caller. Got it. my original plan was that erofs_read_inode() didn't need to handle inode->i_mode stuffs (IOWs, different type of inodes). But I think symlink i_link cases can be handled specifically in erofs_read_inode(). > > Or, for that matter, have it return an int and stuff the body into ->i_link - > it's just that you'd need to set ->i_op there with such approach. > > Not sure, really. BTW, one comment about erofs_fill_symlink() - it's probably > a good idea to use kmemdup_nul() rather than open-coding it (and do that after > the block overflow check, obviously). Yes, let me handle all of this later. Thanks, Gao Xiang
Hi Al, On 2024/4/26 03:56, Al Viro wrote: > On Fri, Apr 12, 2024 at 12:13:42AM +0800, Gao Xiang wrote: ... >> >> Just saw this again by chance, which is unexpected. >> >> Yeah, I think that is a good idea. The story is that erofs_bread() >> was derived from a page-based interface: >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/data.c?h=v5.10#n35 >> >> so it was once a page index number. I think a byte offset will be >> a better interface to clean up these, thanks for your time and work >> on this! > > FWIW, see #misc.erofs and #more.erofs in my tree; the former is the > minimal conversion of erofs_read_buf() and switch from buf->inode > to buf->mapping, the latter follows that up with massage for > erofs_read_metabuf(). > > Completely untested; it builds, but that's all I can promise. Individual > patches in followups. Thanks for so much time on this, I will review/test/feedback these patches by the end of this week since an internal project for my employer is also ongoing. Thanks, Gao Xiang
diff --git a/fs/erofs/data.c b/fs/erofs/data.c index 52524bd9698b..b0a55b4d8c30 100644 --- a/fs/erofs/data.c +++ b/fs/erofs/data.c @@ -70,7 +70,7 @@ void erofs_init_metabuf(struct erofs_buf *buf, struct super_block *sb) if (erofs_is_fscache_mode(sb)) buf->inode = EROFS_SB(sb)->s_fscache->inode; else - buf->inode = sb->s_bdev->bd_inode; + buf->inode = file_inode(sb->s_bdev_file); } void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb,