diff mbox series

[RFC] iomap: generalize IOMAP_INLINE to cover tail-packing case

Message ID 20190703075502.79782-1-yuchao0@huawei.com (mailing list archive)
State New, archived
Headers show
Series [RFC] iomap: generalize IOMAP_INLINE to cover tail-packing case | expand

Commit Message

Chao Yu July 3, 2019, 7:55 a.m. UTC
Some filesystems like erofs/reiserfs have the ability to pack tail
data into metadata, e.g.:
IOMAP_MAPPED [0, 8192]
IOMAP_INLINE [8192, 8200]

However current IOMAP_INLINE type has assumption that:
- inline data should be locating at page #0.
- inline size should equal to .i_size
Those restriction fail to convert to use iomap IOMAP_INLINE in erofs,
so this patch tries to relieve above limits to make IOMAP_INLINE more
generic to cover tail-packing case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
---
 fs/iomap.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Comments

Chao Yu July 8, 2019, 2:12 a.m. UTC | #1
Ping, any comments on this patch?

On 2019/7/3 15:55, Chao Yu wrote:
> Some filesystems like erofs/reiserfs have the ability to pack tail
> data into metadata, e.g.:
> IOMAP_MAPPED [0, 8192]
> IOMAP_INLINE [8192, 8200]
> 
> However current IOMAP_INLINE type has assumption that:
> - inline data should be locating at page #0.
> - inline size should equal to .i_size
> Those restriction fail to convert to use iomap IOMAP_INLINE in erofs,
> so this patch tries to relieve above limits to make IOMAP_INLINE more
> generic to cover tail-packing case.
> 
> Signed-off-by: Chao Yu <yuchao0@huawei.com>
> ---
>  fs/iomap.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 12654c2e78f8..d1c16b692d31 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -264,13 +264,12 @@ static void
>  iomap_read_inline_data(struct inode *inode, struct page *page,
>  		struct iomap *iomap)
>  {
> -	size_t size = i_size_read(inode);
> +	size_t size = iomap->length;
>  	void *addr;
>  
>  	if (PageUptodate(page))
>  		return;
>  
> -	BUG_ON(page->index);
>  	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>  
>  	addr = kmap_atomic(page);
> @@ -293,7 +292,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	sector_t sector;
>  
>  	if (iomap->type == IOMAP_INLINE) {
> -		WARN_ON_ONCE(pos);
>  		iomap_read_inline_data(inode, page, iomap);
>  		return PAGE_SIZE;
>  	}
>
Christoph Hellwig July 8, 2019, 4:03 p.m. UTC | #2
On Wed, Jul 03, 2019 at 03:55:02PM +0800, Chao Yu wrote:
> Some filesystems like erofs/reiserfs have the ability to pack tail
> data into metadata, e.g.:
> IOMAP_MAPPED [0, 8192]
> IOMAP_INLINE [8192, 8200]
> 
> However current IOMAP_INLINE type has assumption that:
> - inline data should be locating at page #0.
> - inline size should equal to .i_size
> Those restriction fail to convert to use iomap IOMAP_INLINE in erofs,
> so this patch tries to relieve above limits to make IOMAP_INLINE more
> generic to cover tail-packing case.
> 
> Signed-off-by: Chao Yu <yuchao0@huawei.com>

This looks good to me, but I'd also like to see a review and gfs2
testing from Andreas.
Chao Yu July 9, 2019, 1:52 p.m. UTC | #3
On 2019-7-9 0:03, Christoph Hellwig wrote:
> On Wed, Jul 03, 2019 at 03:55:02PM +0800, Chao Yu wrote:
>> Some filesystems like erofs/reiserfs have the ability to pack tail
>> data into metadata, e.g.:
>> IOMAP_MAPPED [0, 8192]
>> IOMAP_INLINE [8192, 8200]
>>
>> However current IOMAP_INLINE type has assumption that:
>> - inline data should be locating at page #0.
>> - inline size should equal to .i_size
>> Those restriction fail to convert to use iomap IOMAP_INLINE in erofs,
>> so this patch tries to relieve above limits to make IOMAP_INLINE more
>> generic to cover tail-packing case.
>>
>> Signed-off-by: Chao Yu <yuchao0@huawei.com>
> 
> This looks good to me, but I'd also like to see a review and gfs2
> testing from Andreas.

Thanks for your reply. :)

Well, so, Andreas, could you please take a look at this patch and do related
test on gfs2 if you have time?

Thanks,

>
Andreas Grünbacher July 9, 2019, 11:32 p.m. UTC | #4
Hi Chao,

Am Mi., 3. Juli 2019 um 09:55 Uhr schrieb Chao Yu <yuchao0@huawei.com>:
> Some filesystems like erofs/reiserfs have the ability to pack tail
> data into metadata, e.g.:
> IOMAP_MAPPED [0, 8192]
> IOMAP_INLINE [8192, 8200]
>
> However current IOMAP_INLINE type has assumption that:
> - inline data should be locating at page #0.
> - inline size should equal to .i_size
> Those restriction fail to convert to use iomap IOMAP_INLINE in erofs,
> so this patch tries to relieve above limits to make IOMAP_INLINE more
> generic to cover tail-packing case.

this patch suffers from the following problems:

* Fiemap distinguishes between FIEMAP_EXTENT_DATA_INLINE and
FIEMAP_EXTENT_DATA_TAIL. Someone will sooner or later inevitably use
iomap_fiemap on a filesystem with tail packing, so if we don't make
the same distinction in iomap, iomap_fiemap will silently produce the
wrong result. This means that IOMAP_TAIL should become a separate
mapping type.

* IOMAP_INLINE mappings always start at offset 0 and span an entire
page, so they are always page aligned. IOMAP_TAIL mappings only need
to be block aligned. If the block size is smaller than the page size,
a tail page can consist of more than one mapping. So
iomap_read_inline_data needs to be changed to only copy a single block
out of iomap->inline_data, and we need to adjust iomap_write_begin and
iomap_readpage_actor accordingly.

* Functions iomap_read_inline_data, iomap_write_end_inline, and
iomap_dio_inline_actor currently all assume that we are operating on
page 0, and that iomap->inline_data points at the data at offset 0.
That's no longer the case for IOMAP_TAIL mappings. We need to look
only at the offset within the page / block there.

* There are some asserts like the following int he code:

  BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));

Those should probably be tightened to check for block boundaries
instead of page boundaries, i.e. something like:

  BUG_ON(size > i_blocksize(inode) -
offset_in_block(iomap->inline_data, inode));

> Signed-off-by: Chao Yu <yuchao0@huawei.com>
> ---
>  fs/iomap.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 12654c2e78f8..d1c16b692d31 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -264,13 +264,12 @@ static void
>  iomap_read_inline_data(struct inode *inode, struct page *page,
>                 struct iomap *iomap)
>  {
> -       size_t size = i_size_read(inode);
> +       size_t size = iomap->length;

Function iomap_read_inline_data fills the entire page here, not only
the iomap->length part, so this is wrong. But see my comment above
about changing iomap_read_inline_data to read a single block above as
well.

>         void *addr;
>
>         if (PageUptodate(page))
>                 return;
>
> -       BUG_ON(page->index);
>         BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>
>         addr = kmap_atomic(page);
> @@ -293,7 +292,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>         sector_t sector;
>
>         if (iomap->type == IOMAP_INLINE) {
> -               WARN_ON_ONCE(pos);
>                 iomap_read_inline_data(inode, page, iomap);
>                 return PAGE_SIZE;
>         }

Those last two changes look right to me.

Thanks,
Andreas
Chao Yu July 10, 2019, 10:30 a.m. UTC | #5
Hi Andreas,

Thanks for your review in advance. :)

On 2019/7/10 7:32, Andreas Grünbacher wrote:
> Hi Chao,
> 
> Am Mi., 3. Juli 2019 um 09:55 Uhr schrieb Chao Yu <yuchao0@huawei.com>:
>> Some filesystems like erofs/reiserfs have the ability to pack tail
>> data into metadata, e.g.:
>> IOMAP_MAPPED [0, 8192]
>> IOMAP_INLINE [8192, 8200]
>>
>> However current IOMAP_INLINE type has assumption that:
>> - inline data should be locating at page #0.
>> - inline size should equal to .i_size
>> Those restriction fail to convert to use iomap IOMAP_INLINE in erofs,
>> so this patch tries to relieve above limits to make IOMAP_INLINE more
>> generic to cover tail-packing case.
> 
> this patch suffers from the following problems:
> 
> * Fiemap distinguishes between FIEMAP_EXTENT_DATA_INLINE and
> FIEMAP_EXTENT_DATA_TAIL. Someone will sooner or later inevitably use
> iomap_fiemap on a filesystem with tail packing, so if we don't make
> the same distinction in iomap, iomap_fiemap will silently produce the
> wrong result. This means that IOMAP_TAIL should become a separate
> mapping type.

I'm a little confused, IIUC, IOMAP_TAIL and FIEMAP_EXTENT_DATA_TAIL are with
different semantics:

- IOMAP_TAIl:
  we may introduce this flag to cover tail-packing case, in where we merge
small-sized data in the tail of file into free space of its metadata.
- FIEMAP_EXTENT_DATA_TAIL:
quoted from Documentation/filesystems/fiemap.txt
"  This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is packed into a block with data from other files."
It looks like this flag indicates that blocks from different files stores into
one common block.

I'm not familiar with fiemap flags' history, but maybe FIEMAP_EXTENT_DATA_TAIL
should be better to rename to FIEMAP_EXTENT_DATA_PACKING to avoid ambiguity. And
then FIEMAP_EXTENT_DATA_INLINE will be enough to cover normal inline case and
tail-packing case?

> 
> * IOMAP_INLINE mappings always start at offset 0 and span an entire
> page, so they are always page aligned. IOMAP_TAIL mappings only need

Why can't #0 page consist of more than one block/mapping? I didn't get what's
difference here.

> to be block aligned. If the block size is smaller than the page size,

- reiserfs tries to find last page's content, if the size of that page's valid
content is smaller than threshold (at least less than one block), reiserfs can
do the packing.

- erofs' block size is 4kb which is the same as page size.

Actually, for IOMAP_TAIL, there is no users who need to map more than one block
into metadata, so I'm not sure that we should support that, or we just need to
let code preparing for that to those potential users?

Thanks,

> a tail page can consist of more than one mapping. So
> iomap_read_inline_data needs to be changed to only copy a single block
> out of iomap->inline_data, and we need to adjust iomap_write_begin and
> iomap_readpage_actor accordingly.
> 
> * Functions iomap_read_inline_data, iomap_write_end_inline, and
> iomap_dio_inline_actor currently all assume that we are operating on
> page 0, and that iomap->inline_data points at the data at offset 0.
> That's no longer the case for IOMAP_TAIL mappings. We need to look
> only at the offset within the page / block there.
> 
> * There are some asserts like the following int he code:
> 
>   BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
> 
> Those should probably be tightened to check for block boundaries
> instead of page boundaries, i.e. something like:
> 
>   BUG_ON(size > i_blocksize(inode) -
> offset_in_block(iomap->inline_data, inode));
> 
>> Signed-off-by: Chao Yu <yuchao0@huawei.com>
>> ---
>>  fs/iomap.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/fs/iomap.c b/fs/iomap.c
>> index 12654c2e78f8..d1c16b692d31 100644
>> --- a/fs/iomap.c
>> +++ b/fs/iomap.c
>> @@ -264,13 +264,12 @@ static void
>>  iomap_read_inline_data(struct inode *inode, struct page *page,
>>                 struct iomap *iomap)
>>  {
>> -       size_t size = i_size_read(inode);
>> +       size_t size = iomap->length;
> 
> Function iomap_read_inline_data fills the entire page here, not only
> the iomap->length part, so this is wrong. But see my comment above
> about changing iomap_read_inline_data to read a single block above as
> well.
> 
>>         void *addr;
>>
>>         if (PageUptodate(page))
>>                 return;
>>
>> -       BUG_ON(page->index);
>>         BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>
>>         addr = kmap_atomic(page);
>> @@ -293,7 +292,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>>         sector_t sector;
>>
>>         if (iomap->type == IOMAP_INLINE) {
>> -               WARN_ON_ONCE(pos);
>>                 iomap_read_inline_data(inode, page, iomap);
>>                 return PAGE_SIZE;
>>         }
> 
> Those last two changes look right to me.
> 
> Thanks,
> Andreas
> .
>
Andreas Grünbacher July 10, 2019, 9:50 p.m. UTC | #6
Am Mi., 10. Juli 2019 um 12:31 Uhr schrieb Chao Yu <yuchao0@huawei.com>:
> Hi Andreas,
>
> Thanks for your review in advance. :)
>
> On 2019/7/10 7:32, Andreas Grünbacher wrote:
> > Hi Chao,
> >
> > Am Mi., 3. Juli 2019 um 09:55 Uhr schrieb Chao Yu <yuchao0@huawei.com>:
> >> Some filesystems like erofs/reiserfs have the ability to pack tail
> >> data into metadata, e.g.:
> >> IOMAP_MAPPED [0, 8192]
> >> IOMAP_INLINE [8192, 8200]
> >>
> >> However current IOMAP_INLINE type has assumption that:
> >> - inline data should be locating at page #0.
> >> - inline size should equal to .i_size
> >> Those restriction fail to convert to use iomap IOMAP_INLINE in erofs,
> >> so this patch tries to relieve above limits to make IOMAP_INLINE more
> >> generic to cover tail-packing case.
> >
> > this patch suffers from the following problems:
> >
> > * Fiemap distinguishes between FIEMAP_EXTENT_DATA_INLINE and
> > FIEMAP_EXTENT_DATA_TAIL. Someone will sooner or later inevitably use
> > iomap_fiemap on a filesystem with tail packing, so if we don't make
> > the same distinction in iomap, iomap_fiemap will silently produce the
> > wrong result. This means that IOMAP_TAIL should become a separate
> > mapping type.
>
> I'm a little confused, IIUC, IOMAP_TAIL and FIEMAP_EXTENT_DATA_TAIL are with
> different semantics:
>
> - IOMAP_TAIL:
>   we may introduce this flag to cover tail-packing case, in where we merge
> small-sized data in the tail of file into free space of its metadata.
> - FIEMAP_EXTENT_DATA_TAIL:
> quoted from Documentation/filesystems/fiemap.txt
> "  This will also set FIEMAP_EXTENT_NOT_ALIGNED
> Data is packed into a block with data from other files."
> It looks like this flag indicates that blocks from different files stores into
> one common block.

Well, maybe FIEMAP_EXTENT_DATA_INLINE is indeed the more appropriate
flag in your scenario. In that case, we should be fine on the fiemap
front.

> I'm not familiar with fiemap flags' history, but maybe FIEMAP_EXTENT_DATA_TAIL
> should be better to rename to FIEMAP_EXTENT_DATA_PACKING to avoid ambiguity. And
> then FIEMAP_EXTENT_DATA_INLINE will be enough to cover normal inline case and
> tail-packing case?

Those definitions are user-visible.

> > * IOMAP_INLINE mappings always start at offset 0 and span an entire
> > page, so they are always page aligned. IOMAP_TAIL mappings only need
>
> Why can't #0 page consist of more than one block/mapping? I didn't get what's
> difference here.

Currently, iomap_write_begin will read the inline data into page #0
and mark that page up-to-date. There's a check for that in
iomap_write_end_inline. If a page contains more than one mapping, we
won't be able to mark the entire page up to date anymore; we'd need to
track which parts of the page are up to date and which are not. Iomap
supports two tracking mechanisms, buffer heads and iomap_page, and
we'd need to implement that tracking code for each of those cases.

> > to be block aligned. If the block size is smaller than the page size,
>
> - reiserfs tries to find last page's content, if the size of that page's valid
> content is smaller than threshold (at least less than one block), reiserfs can
> do the packing.
>
> - erofs' block size is 4kb which is the same as page size.
>
>
> Actually, for IOMAP_TAIL, there is no users who need to map more than one block
> into metadata, so I'm not sure that we should support that, or we just need to
> let code preparing for that to those potential users?

How about architectures with PAGE_SIZE > 4096? I'm assuming that erofs
doesn't require block size == PAGE_SIZE?

> Thanks,
>
> > a tail page can consist of more than one mapping. So
> > iomap_read_inline_data needs to be changed to only copy a single block
> > out of iomap->inline_data, and we need to adjust iomap_write_begin and
> > iomap_readpage_actor accordingly.
> >
> > * Functions iomap_read_inline_data, iomap_write_end_inline, and
> > iomap_dio_inline_actor currently all assume that we are operating on
> > page 0, and that iomap->inline_data points at the data at offset 0.
> > That's no longer the case for IOMAP_TAIL mappings. We need to look
> > only at the offset within the page / block there.
> >
> > * There are some asserts like the following int he code:
> >
> >   BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
> >
> > Those should probably be tightened to check for block boundaries
> > instead of page boundaries, i.e. something like:
> >
> >   BUG_ON(size > i_blocksize(inode) -
> > offset_in_block(iomap->inline_data, inode));
> >
> >> Signed-off-by: Chao Yu <yuchao0@huawei.com>
> >> ---
> >>  fs/iomap.c | 4 +---
> >>  1 file changed, 1 insertion(+), 3 deletions(-)
> >>
> >> diff --git a/fs/iomap.c b/fs/iomap.c
> >> index 12654c2e78f8..d1c16b692d31 100644
> >> --- a/fs/iomap.c
> >> +++ b/fs/iomap.c
> >> @@ -264,13 +264,12 @@ static void
> >>  iomap_read_inline_data(struct inode *inode, struct page *page,
> >>                 struct iomap *iomap)
> >>  {
> >> -       size_t size = i_size_read(inode);
> >> +       size_t size = iomap->length;
> >
> > Function iomap_read_inline_data fills the entire page here, not only
> > the iomap->length part, so this is wrong. But see my comment above
> > about changing iomap_read_inline_data to read a single block above as
> > well.
> >
> >>         void *addr;
> >>
> >>         if (PageUptodate(page))
> >>                 return;
> >>
> >> -       BUG_ON(page->index);
> >>         BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
> >>
> >>         addr = kmap_atomic(page);
> >> @@ -293,7 +292,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >>         sector_t sector;
> >>
> >>         if (iomap->type == IOMAP_INLINE) {
> >> -               WARN_ON_ONCE(pos);
> >>                 iomap_read_inline_data(inode, page, iomap);
> >>                 return PAGE_SIZE;
> >>         }
> >
> > Those last two changes look right to me.
> >
> > Thanks,
> > Andreas
> > .
> >

At this point, can I ask how important this packing mechanism is to
you? I can see a point in implementing inline files, which help
because there tends to be a large number of very small files. But for
not-so-small files, is saving an extra block really worth the trouble,
especially given how cheap storage has become?

Thanks,
Andreas
Gao Xiang July 10, 2019, 11:42 p.m. UTC | #7
At 2019/7/11 ??????5:50, Andreas Gr??nbacher Wrote:
> At this point, can I ask how important this packing mechanism is to
> you? I can see a point in implementing inline files, which help
> because there tends to be a large number of very small files. But for
> not-so-small files, is saving an extra block really worth the trouble,
> especially given how cheap storage has become?

I would try to answer the above. I think there are several advantages by
using tail-end packing inline:
1) It is more cache-friendly. Considering a file "A" accessed by user
now or recently, we
?????? tend to (1) get more data about "A" (2) leave more data about "A"
according to LRU-like assumption
?????? because it is more likely to be used than the metadata of some other
files "X", especially for files whose
?????? tail-end block is relatively small enough (less than a threshold,
e.g. < 100B just for example);

2) for directories files, tail-end packing will boost up those traversal
performance;

3) I think tail-end packing is a more generic inline, it saves I/Os for
generic cases not just to
?????? save the storage space;

"is saving an extra block really worth the trouble" I dont understand
what exact the trouble is...


Thanks,
Gao Xiang
Andreas Gruenbacher July 11, 2019, 12:28 p.m. UTC | #8
Something along the lines of the attached, broken patch might work in
the end.

Andreas

---
 fs/buffer.c           | 10 ++++--
 fs/iomap.c            | 74 +++++++++++++++++++++++++++++--------------
 include/linux/iomap.h |  3 ++
 3 files changed, 61 insertions(+), 26 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e450c55f6434..8d8668e377ab 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1873,8 +1873,8 @@ void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
 EXPORT_SYMBOL(page_zero_new_buffers);
 
 static void
-iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
-		struct iomap *iomap)
+iomap_to_bh(struct inode *inode, struct page *page, sector_t block,
+		struct buffer_head *bh, struct iomap *iomap)
 {
 	loff_t offset = block << inode->i_blkbits;
 
@@ -1924,6 +1924,10 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
 				inode->i_blkbits;
 		set_buffer_mapped(bh);
 		break;
+	case IOMAP_INLINE:
+		__iomap_read_inline_data(inode, page, iomap);
+		set_buffer_uptodate(bh);
+		break;
 	}
 }
 
@@ -1969,7 +1973,7 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
 				if (err)
 					break;
 			} else {
-				iomap_to_bh(inode, block, bh, iomap);
+				iomap_to_bh(inode, page, block, bh, iomap);
 			}
 
 			if (buffer_new(bh)) {
diff --git a/fs/iomap.c b/fs/iomap.c
index 45aa58e837b5..61188e95def2 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -260,24 +260,47 @@ struct iomap_readpage_ctx {
 	struct list_head	*pages;
 };
 
-static void
-iomap_read_inline_data(struct inode *inode, struct page *page,
+#define offset_in_block(offset, inode) \
+	((unsigned long)(offset) & (i_blocksize(inode) - 1))
+
+static bool
+inline_data_within_block(struct inode *inode, struct iomap *iomap,
+		unsigned int size)
+{
+	unsigned int off = offset_in_block(iomap->inline_data, inode);
+
+	return size <= i_blocksize(inode) - off;
+}
+
+void
+__iomap_read_inline_data(struct inode *inode, struct page *page,
 		struct iomap *iomap)
 {
-	size_t size = i_size_read(inode);
+	size_t size = offset_in_block(i_size_read(inode), inode);
+	unsigned int poff = offset_in_page(iomap->offset);
+	unsigned int bsize = i_blocksize(inode);
 	void *addr;
 
 	if (PageUptodate(page))
 		return;
 
-	BUG_ON(page->index);
-	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
+	BUG_ON(!inline_data_within_block(inode, iomap, size));
 
 	addr = kmap_atomic(page);
-	memcpy(addr, iomap->inline_data, size);
-	memset(addr + size, 0, PAGE_SIZE - size);
+	memcpy(addr + poff, iomap->inline_data, size);
+	memset(addr + poff + size, 0, bsize - size);
 	kunmap_atomic(addr);
-	SetPageUptodate(page);
+}
+
+static void
+iomap_read_inline_data(struct inode *inode, struct page *page,
+		struct iomap *iomap)
+{
+	unsigned int poff = offset_in_page(iomap->offset);
+	unsigned int bsize = i_blocksize(inode);
+
+	__iomap_read_inline_data(inode, page, iomap);
+	iomap_set_range_uptodate(page, poff, bsize);
 }
 
 static loff_t
@@ -292,11 +315,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	unsigned poff, plen;
 	sector_t sector;
 
-	if (iomap->type == IOMAP_INLINE) {
-		WARN_ON_ONCE(pos);
+	if (iomap->type == IOMAP_INLINE)
 		iomap_read_inline_data(inode, page, iomap);
-		return PAGE_SIZE;
-	}
 
 	/* zero post-eof blocks as the page may be mapped */
 	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
@@ -637,6 +657,11 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
 	if (PageUptodate(page))
 		return 0;
 
+	if (iomap->type == IOMAP_INLINE) {
+		iomap_read_inline_data(inode, page, iomap);
+		return 0;
+	}
+
 	do {
 		iomap_adjust_read_range(inode, iop, &block_start,
 				block_end - block_start, &poff, &plen);
@@ -682,9 +707,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		goto out_no_page;
 	}
 
-	if (iomap->type == IOMAP_INLINE)
-		iomap_read_inline_data(inode, page, iomap);
-	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
+	if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, iomap);
 	else
 		status = __iomap_write_begin(inode, pos, len, page, iomap);
@@ -761,11 +784,11 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
 {
 	void *addr;
 
-	WARN_ON_ONCE(!PageUptodate(page));
-	BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
+	BUG_ON(!inline_data_within_block(inode, iomap, pos + copied));
 
 	addr = kmap_atomic(page);
-	memcpy(iomap->inline_data + pos, addr + pos, copied);
+	memcpy(iomap->inline_data + offset_in_block(pos, inode),
+	       addr + offset_in_page(pos), copied);
 	kunmap_atomic(addr);
 
 	mark_inode_dirty(inode);
@@ -1064,7 +1087,7 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops)
 {
 	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
+	unsigned int off = offset_in_block(pos, inode);
 
 	/* Block boundary? Nothing to do */
 	if (!off)
@@ -1772,21 +1795,26 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct iov_iter *iter = dio->submit.iter;
 	size_t copied;
 
-	BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data));
+	BUG_ON(!inline_data_within_block(inode, iomap, pos + length));
 
 	if (dio->flags & IOMAP_DIO_WRITE) {
 		loff_t size = inode->i_size;
 
 		if (pos > size)
-			memset(iomap->inline_data + size, 0, pos - size);
-		copied = copy_from_iter(iomap->inline_data + pos, length, iter);
+			memset(iomap->inline_data +
+			       offset_in_block(size, inode), 0, pos - size);
+		copied = copy_from_iter(iomap->inline_data +
+					offset_in_block(pos, inode),
+					length, iter);
 		if (copied) {
 			if (pos + copied > size)
 				i_size_write(inode, pos + copied);
 			mark_inode_dirty(inode);
 		}
 	} else {
-		copied = copy_to_iter(iomap->inline_data + pos, length, iter);
+		copied = copy_to_iter(iomap->inline_data +
+				      offset_in_block(pos, inode),
+				      length, iter);
 	}
 	dio->size += copied;
 	return copied;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2103b94cb1bf..a8a60dd2fdc0 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -131,6 +131,9 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
 	return NULL;
 }
 
+void __iomap_read_inline_data(struct inode *inode, struct page *page,
+		struct iomap *iomap);
+
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
Matthew Wilcox July 11, 2019, 1:06 p.m. UTC | #9
On Thu, Jul 11, 2019 at 07:42:20AM +0800, Gao Xiang wrote:
> 
> At 2019/7/11 ??????5:50, Andreas Gr??nbacher Wrote:
> > At this point, can I ask how important this packing mechanism is to
> > you? I can see a point in implementing inline files, which help
> > because there tends to be a large number of very small files. But for
> > not-so-small files, is saving an extra block really worth the trouble,
> > especially given how cheap storage has become?
> 
> I would try to answer the above. I think there are several advantages by
> using tail-end packing inline:
> 1) It is more cache-friendly. Considering a file "A" accessed by user
> now or recently, we
> ?????? tend to (1) get more data about "A" (2) leave more data about "A"
> according to LRU-like assumption
> ?????? because it is more likely to be used than the metadata of some other
> files "X", especially for files whose
> ?????? tail-end block is relatively small enough (less than a threshold,
> e.g. < 100B just for example);
> 
> 2) for directories files, tail-end packing will boost up those traversal
> performance;
> 
> 3) I think tail-end packing is a more generic inline, it saves I/Os for
> generic cases not just to
> ?????? save the storage space;
> 
> "is saving an extra block really worth the trouble" I dont understand
> what exact the trouble is...

"the trouble" is adding code complexity and additional things to test.

I'm not sure you really understood Andreas' question.  He's saying that he
understands the performance and space gain from packing short files
(eg files less than 100 bytes).  But how many files are there between
4096 and 4196 bytes in size, let alone between 8192 and 8292, 12384 and
12484 ...

Is optimising for _those_ files worth it?
Gao Xiang July 11, 2019, 1:54 p.m. UTC | #10
On 2019/7/11 21:06, Matthew Wilcox wrote:
> On Thu, Jul 11, 2019 at 07:42:20AM +0800, Gao Xiang wrote:
>>
>> At 2019/7/11 ??????5:50, Andreas Gr??nbacher Wrote:
>>> At this point, can I ask how important this packing mechanism is to
>>> you? I can see a point in implementing inline files, which help
>>> because there tends to be a large number of very small files. But for
>>> not-so-small files, is saving an extra block really worth the trouble,
>>> especially given how cheap storage has become?
>>
>> I would try to answer the above. I think there are several advantages by
>> using tail-end packing inline:
>> 1) It is more cache-friendly. Considering a file "A" accessed by user
>> now or recently, we
>> ?????? tend to (1) get more data about "A" (2) leave more data about "A"
>> according to LRU-like assumption
>> ?????? because it is more likely to be used than the metadata of some other
>> files "X", especially for files whose
>> ?????? tail-end block is relatively small enough (less than a threshold,
>> e.g. < 100B just for example);
>>
>> 2) for directories files, tail-end packing will boost up those traversal
>> performance;
>>
>> 3) I think tail-end packing is a more generic inline, it saves I/Os for
>> generic cases not just to
>> ?????? save the storage space;
>>
>> "is saving an extra block really worth the trouble" I dont understand
>> what exact the trouble is...
> 
> "the trouble" is adding code complexity and additional things to test.
> 
> I'm not sure you really understood Andreas' question.  He's saying that he
> understands the performance and space gain from packing short files
> (eg files less than 100 bytes).  But how many files are there between
> 4096 and 4196 bytes in size, let alone between 8192 and 8292, 12384 and
> 12484 ...
> 
> Is optimising for _those_ files worth it?

Hi Willy,

Thanks for your kindly explanation.. I get it :) I try to express my thoughts in
the following aspects...

1) In my thought, I personally think Chao's first patch which adds an additional
   type could be better for now, maybe we can reduce duplicate code based on that
   patch even further. What EROFS needs is only a read-only tail-end packing,
   I think for write we actually need to rethink more carefully (but it doesn't
   mean complex I think, but I don't do research on this.. I have to be silent...)
   and maybe we could leave it until a really fs user switching to iomap and mix
   INLINE and TAIL at that time...

2) EROFS actually has an unfinished feature which supports tail-end packing
   compresssed data, which means decompressed data could not be so small...
   and I know that is another matter... So to direct answer the question is
   that it depends on the userdata and user. For EROFS, tail-end packing
   inline is easy to implement, and it's a per-file optional feature (not
   mandatory) and the threshold (< 100B) is not a hardcoded limit as well,
   which is configured by mkfs users and only help mkfs decide whether the
   file should enable it or not. it should be useful for all directories
   at least, and I think it is more cache-friendly for regular files as well
   so a large range of files configured by users (not just < 100B) can be
   benefited from this...

Sorry about my English...

Thanks,
Gao Xiang

>
Chao Yu July 11, 2019, 2:15 p.m. UTC | #11
On 2019-7-11 5:50, Andreas Grünbacher wrote:
> Am Mi., 10. Juli 2019 um 12:31 Uhr schrieb Chao Yu <yuchao0@huawei.com>:
>> Hi Andreas,
>>
>> Thanks for your review in advance. :)
>>
>> On 2019/7/10 7:32, Andreas Grünbacher wrote:
>>> Hi Chao,
>>>
>>> Am Mi., 3. Juli 2019 um 09:55 Uhr schrieb Chao Yu <yuchao0@huawei.com>:
>>>> Some filesystems like erofs/reiserfs have the ability to pack tail
>>>> data into metadata, e.g.:
>>>> IOMAP_MAPPED [0, 8192]
>>>> IOMAP_INLINE [8192, 8200]
>>>>
>>>> However current IOMAP_INLINE type has assumption that:
>>>> - inline data should be locating at page #0.
>>>> - inline size should equal to .i_size
>>>> Those restriction fail to convert to use iomap IOMAP_INLINE in erofs,
>>>> so this patch tries to relieve above limits to make IOMAP_INLINE more
>>>> generic to cover tail-packing case.
>>>
>>> this patch suffers from the following problems:
>>>
>>> * Fiemap distinguishes between FIEMAP_EXTENT_DATA_INLINE and
>>> FIEMAP_EXTENT_DATA_TAIL. Someone will sooner or later inevitably use
>>> iomap_fiemap on a filesystem with tail packing, so if we don't make
>>> the same distinction in iomap, iomap_fiemap will silently produce the
>>> wrong result. This means that IOMAP_TAIL should become a separate
>>> mapping type.
>>
>> I'm a little confused, IIUC, IOMAP_TAIL and FIEMAP_EXTENT_DATA_TAIL are with
>> different semantics:
>>
>> - IOMAP_TAIL:
>>   we may introduce this flag to cover tail-packing case, in where we merge
>> small-sized data in the tail of file into free space of its metadata.
>> - FIEMAP_EXTENT_DATA_TAIL:
>> quoted from Documentation/filesystems/fiemap.txt
>> "  This will also set FIEMAP_EXTENT_NOT_ALIGNED
>> Data is packed into a block with data from other files."
>> It looks like this flag indicates that blocks from different files stores into
>> one common block.
> 
> Well, maybe FIEMAP_EXTENT_DATA_INLINE is indeed the more appropriate
> flag in your scenario. In that case, we should be fine on the fiemap
> front.

Yup.

> 
>> I'm not familiar with fiemap flags' history, but maybe FIEMAP_EXTENT_DATA_TAIL
>> should be better to rename to FIEMAP_EXTENT_DATA_PACKING to avoid ambiguity. And
>> then FIEMAP_EXTENT_DATA_INLINE will be enough to cover normal inline case and
>> tail-packing case?
> 
> Those definitions are user-visible.
> 
>>> * IOMAP_INLINE mappings always start at offset 0 and span an entire
>>> page, so they are always page aligned. IOMAP_TAIL mappings only need
>>
>> Why can't #0 page consist of more than one block/mapping? I didn't get what's
>> difference here.
> 
> Currently, iomap_write_begin will read the inline data into page #0
> and mark that page up-to-date. There's a check for that in
> iomap_write_end_inline. If a page contains more than one mapping, we
> won't be able to mark the entire page up to date anymore; we'd need to
> track which parts of the page are up to date and which are not. Iomap
> supports two tracking mechanisms, buffer heads and iomap_page, and
> we'd need to implement that tracking code for each of those cases.

Okay, I can understand now, so the reason here is the limitation (BUG_ON,
WARM_ON in iomap_inline_xxx functions) makes inline page should only contain one
mapping, then to generalize it, we should consider the unaligned case in between
page size and block size.

> 
>>> to be block aligned. If the block size is smaller than the page size,
>>
>> - reiserfs tries to find last page's content, if the size of that page's valid
>> content is smaller than threshold (at least less than one block), reiserfs can
>> do the packing.
>>
>> - erofs' block size is 4kb which is the same as page size.
>>
>>
>> Actually, for IOMAP_TAIL, there is no users who need to map more than one block
>> into metadata, so I'm not sure that we should support that, or we just need to
>> let code preparing for that to those potential users?
> 
> How about architectures with PAGE_SIZE > 4096? I'm assuming that erofs

We haven't considered the PAGE_SIZE > 4096 case yet.

> doesn't require block size == PAGE_SIZE?

At least now erofs block size is 4KB by default.

> 
>> Thanks,
>>
>>> a tail page can consist of more than one mapping. So
>>> iomap_read_inline_data needs to be changed to only copy a single block
>>> out of iomap->inline_data, and we need to adjust iomap_write_begin and
>>> iomap_readpage_actor accordingly.
>>>
>>> * Functions iomap_read_inline_data, iomap_write_end_inline, and
>>> iomap_dio_inline_actor currently all assume that we are operating on
>>> page 0, and that iomap->inline_data points at the data at offset 0.
>>> That's no longer the case for IOMAP_TAIL mappings. We need to look
>>> only at the offset within the page / block there.
>>>
>>> * There are some asserts like the following int he code:
>>>
>>>   BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>>
>>> Those should probably be tightened to check for block boundaries
>>> instead of page boundaries, i.e. something like:
>>>
>>>   BUG_ON(size > i_blocksize(inode) -
>>> offset_in_block(iomap->inline_data, inode));
>>>
>>>> Signed-off-by: Chao Yu <yuchao0@huawei.com>
>>>> ---
>>>>  fs/iomap.c | 4 +---
>>>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/iomap.c b/fs/iomap.c
>>>> index 12654c2e78f8..d1c16b692d31 100644
>>>> --- a/fs/iomap.c
>>>> +++ b/fs/iomap.c
>>>> @@ -264,13 +264,12 @@ static void
>>>>  iomap_read_inline_data(struct inode *inode, struct page *page,
>>>>                 struct iomap *iomap)
>>>>  {
>>>> -       size_t size = i_size_read(inode);
>>>> +       size_t size = iomap->length;
>>>
>>> Function iomap_read_inline_data fills the entire page here, not only
>>> the iomap->length part, so this is wrong. But see my comment above
>>> about changing iomap_read_inline_data to read a single block above as
>>> well.
>>>
>>>>         void *addr;
>>>>
>>>>         if (PageUptodate(page))
>>>>                 return;
>>>>
>>>> -       BUG_ON(page->index);
>>>>         BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>>>
>>>>         addr = kmap_atomic(page);
>>>> @@ -293,7 +292,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>>>>         sector_t sector;
>>>>
>>>>         if (iomap->type == IOMAP_INLINE) {
>>>> -               WARN_ON_ONCE(pos);
>>>>                 iomap_read_inline_data(inode, page, iomap);
>>>>                 return PAGE_SIZE;
>>>>         }
>>>
>>> Those last two changes look right to me.
>>>
>>> Thanks,
>>> Andreas
>>> .
>>>
> 
> At this point, can I ask how important this packing mechanism is to
> you? I can see a point in implementing inline files, which help
> because there tends to be a large number of very small files. But for
> not-so-small files, is saving an extra block really worth the trouble,
> especially given how cheap storage has become?

- Erofs is a readonly filesystem, we don't need to consider/design write path
for tail-pack feature, including writeback and inline conversion cases. So code
complex didn't trouble us.

- The files from real user's scenario are always unaligned to page size or block
size, so it can expect that part of files can be packed its tail data; and erofs
is used in consumer-electronics now (cell phone), there is no such large size
storage, we think that tail-packing can be one of compression/compaction
solutions in erofs to save storage space as much as possible.

Thanks,

> 
> Thanks,
> Andreas
>
Chao Yu July 12, 2019, 9:31 a.m. UTC | #12
On 2019/7/11 20:28, Andreas Gruenbacher wrote:
> Something along the lines of the attached, broken patch might work in
> the end.
> 
> Andreas
> 
> ---
>  fs/buffer.c           | 10 ++++--
>  fs/iomap.c            | 74 +++++++++++++++++++++++++++++--------------
>  include/linux/iomap.h |  3 ++
>  3 files changed, 61 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e450c55f6434..8d8668e377ab 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1873,8 +1873,8 @@ void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
>  EXPORT_SYMBOL(page_zero_new_buffers);
>  
>  static void
> -iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
> -		struct iomap *iomap)
> +iomap_to_bh(struct inode *inode, struct page *page, sector_t block,
> +		struct buffer_head *bh, struct iomap *iomap)
>  {
>  	loff_t offset = block << inode->i_blkbits;
>  
> @@ -1924,6 +1924,10 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
>  				inode->i_blkbits;
>  		set_buffer_mapped(bh);
>  		break;
> +	case IOMAP_INLINE:
> +		__iomap_read_inline_data(inode, page, iomap);
> +		set_buffer_uptodate(bh);
> +		break;
>  	}
>  }
>  
> @@ -1969,7 +1973,7 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
>  				if (err)
>  					break;
>  			} else {
> -				iomap_to_bh(inode, block, bh, iomap);
> +				iomap_to_bh(inode, page, block, bh, iomap);
>  			}
>  
>  			if (buffer_new(bh)) {
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 45aa58e837b5..61188e95def2 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -260,24 +260,47 @@ struct iomap_readpage_ctx {
>  	struct list_head	*pages;
>  };
>  
> -static void
> -iomap_read_inline_data(struct inode *inode, struct page *page,
> +#define offset_in_block(offset, inode) \
> +	((unsigned long)(offset) & (i_blocksize(inode) - 1))
> +
> +static bool
> +inline_data_within_block(struct inode *inode, struct iomap *iomap,
> +		unsigned int size)
> +{
> +	unsigned int off = offset_in_block(iomap->inline_data, inode);
> +
> +	return size <= i_blocksize(inode) - off;
> +}
> +
> +void
> +__iomap_read_inline_data(struct inode *inode, struct page *page,
>  		struct iomap *iomap)
>  {
> -	size_t size = i_size_read(inode);
> +	size_t size = offset_in_block(i_size_read(inode), inode);
> +	unsigned int poff = offset_in_page(iomap->offset);
> +	unsigned int bsize = i_blocksize(inode);
>  	void *addr;
>  
>  	if (PageUptodate(page))
>  		return;
>  
> -	BUG_ON(page->index);
> -	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
> +	BUG_ON(!inline_data_within_block(inode, iomap, size));
>  
>  	addr = kmap_atomic(page);
> -	memcpy(addr, iomap->inline_data, size);
> -	memset(addr + size, 0, PAGE_SIZE - size);
> +	memcpy(addr + poff, iomap->inline_data, size);
> +	memset(addr + poff + size, 0, bsize - size);
>  	kunmap_atomic(addr);
> -	SetPageUptodate(page);
> +}
> +
> +static void
> +iomap_read_inline_data(struct inode *inode, struct page *page,
> +		struct iomap *iomap)
> +{
> +	unsigned int poff = offset_in_page(iomap->offset);
> +	unsigned int bsize = i_blocksize(inode);
> +
> +	__iomap_read_inline_data(inode, page, iomap);
> +	iomap_set_range_uptodate(page, poff, bsize);
>  }
>  
>  static loff_t
> @@ -292,11 +315,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	unsigned poff, plen;
>  	sector_t sector;
>  
> -	if (iomap->type == IOMAP_INLINE) {
> -		WARN_ON_ONCE(pos);
> +	if (iomap->type == IOMAP_INLINE)
>  		iomap_read_inline_data(inode, page, iomap);
> -		return PAGE_SIZE;

Hi Andreas,

Thanks for your patch.

In my erofs test case, filled inline data will be zeroed out due to we fallback
to following flow:

	if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
		zero_user(page, poff, plen);

Should we return before this condition check?

Thanks,

> -	}
>  
>  	/* zero post-eof blocks as the page may be mapped */
>  	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
> @@ -637,6 +657,11 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
>  	if (PageUptodate(page))
>  		return 0;
>  
> +	if (iomap->type == IOMAP_INLINE) {
> +		iomap_read_inline_data(inode, page, iomap);
> +		return 0;
> +	}
> +
>  	do {
>  		iomap_adjust_read_range(inode, iop, &block_start,
>  				block_end - block_start, &poff, &plen);
> @@ -682,9 +707,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  		goto out_no_page;
>  	}
>  
> -	if (iomap->type == IOMAP_INLINE)
> -		iomap_read_inline_data(inode, page, iomap);
> -	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> +	if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>  		status = __block_write_begin_int(page, pos, len, NULL, iomap);
>  	else
>  		status = __iomap_write_begin(inode, pos, len, page, iomap);
> @@ -761,11 +784,11 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
>  {
>  	void *addr;
>  
> -	WARN_ON_ONCE(!PageUptodate(page));
> -	BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
> +	BUG_ON(!inline_data_within_block(inode, iomap, pos + copied));
>  
>  	addr = kmap_atomic(page);
> -	memcpy(iomap->inline_data + pos, addr + pos, copied);
> +	memcpy(iomap->inline_data + offset_in_block(pos, inode),
> +	       addr + offset_in_page(pos), copied);
>  	kunmap_atomic(addr);
>  
>  	mark_inode_dirty(inode);
> @@ -1064,7 +1087,7 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>  		const struct iomap_ops *ops)
>  {
>  	unsigned int blocksize = i_blocksize(inode);
> -	unsigned int off = pos & (blocksize - 1);
> +	unsigned int off = offset_in_block(pos, inode);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!off)
> @@ -1772,21 +1795,26 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
>  	struct iov_iter *iter = dio->submit.iter;
>  	size_t copied;
>  
> -	BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data));
> +	BUG_ON(!inline_data_within_block(inode, iomap, pos + length));
>  
>  	if (dio->flags & IOMAP_DIO_WRITE) {
>  		loff_t size = inode->i_size;
>  
>  		if (pos > size)
> -			memset(iomap->inline_data + size, 0, pos - size);
> -		copied = copy_from_iter(iomap->inline_data + pos, length, iter);
> +			memset(iomap->inline_data +
> +			       offset_in_block(size, inode), 0, pos - size);
> +		copied = copy_from_iter(iomap->inline_data +
> +					offset_in_block(pos, inode),
> +					length, iter);
>  		if (copied) {
>  			if (pos + copied > size)
>  				i_size_write(inode, pos + copied);
>  			mark_inode_dirty(inode);
>  		}
>  	} else {
> -		copied = copy_to_iter(iomap->inline_data + pos, length, iter);
> +		copied = copy_to_iter(iomap->inline_data +
> +				      offset_in_block(pos, inode),
> +				      length, iter);
>  	}
>  	dio->size += copied;
>  	return copied;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 2103b94cb1bf..a8a60dd2fdc0 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -131,6 +131,9 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
>  	return NULL;
>  }
>  
> +void __iomap_read_inline_data(struct inode *inode, struct page *page,
> +		struct iomap *iomap);
> +
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>  		const struct iomap_ops *ops);
>  int iomap_readpage(struct page *page, const struct iomap_ops *ops);
>
Andreas Gruenbacher July 12, 2019, 11:54 a.m. UTC | #13
On Fri, 12 Jul 2019 at 11:31, Chao Yu <yuchao0@huawei.com> wrote:
> On 2019/7/11 20:28, Andreas Gruenbacher wrote:
> > Something along the lines of the attached, broken patch might work in
> > the end.
> >
> > Andreas
> >
> > ---
> >  fs/buffer.c           | 10 ++++--
> >  fs/iomap.c            | 74 +++++++++++++++++++++++++++++--------------
> >  include/linux/iomap.h |  3 ++
> >  3 files changed, 61 insertions(+), 26 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index e450c55f6434..8d8668e377ab 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1873,8 +1873,8 @@ void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
> >  EXPORT_SYMBOL(page_zero_new_buffers);
> >
> >  static void
> > -iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
> > -             struct iomap *iomap)
> > +iomap_to_bh(struct inode *inode, struct page *page, sector_t block,
> > +             struct buffer_head *bh, struct iomap *iomap)
> >  {
> >       loff_t offset = block << inode->i_blkbits;
> >
> > @@ -1924,6 +1924,10 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
> >                               inode->i_blkbits;
> >               set_buffer_mapped(bh);
> >               break;
> > +     case IOMAP_INLINE:
> > +             __iomap_read_inline_data(inode, page, iomap);
> > +             set_buffer_uptodate(bh);
> > +             break;
> >       }
> >  }
> >
> > @@ -1969,7 +1973,7 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
> >                               if (err)
> >                                       break;
> >                       } else {
> > -                             iomap_to_bh(inode, block, bh, iomap);
> > +                             iomap_to_bh(inode, page, block, bh, iomap);
> >                       }
> >
> >                       if (buffer_new(bh)) {
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index 45aa58e837b5..61188e95def2 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -260,24 +260,47 @@ struct iomap_readpage_ctx {
> >       struct list_head        *pages;
> >  };
> >
> > -static void
> > -iomap_read_inline_data(struct inode *inode, struct page *page,
> > +#define offset_in_block(offset, inode) \
> > +     ((unsigned long)(offset) & (i_blocksize(inode) - 1))
> > +
> > +static bool
> > +inline_data_within_block(struct inode *inode, struct iomap *iomap,
> > +             unsigned int size)
> > +{
> > +     unsigned int off = offset_in_block(iomap->inline_data, inode);
> > +
> > +     return size <= i_blocksize(inode) - off;
> > +}
> > +
> > +void
> > +__iomap_read_inline_data(struct inode *inode, struct page *page,
> >               struct iomap *iomap)
> >  {
> > -     size_t size = i_size_read(inode);
> > +     size_t size = offset_in_block(i_size_read(inode), inode);
> > +     unsigned int poff = offset_in_page(iomap->offset);
> > +     unsigned int bsize = i_blocksize(inode);
> >       void *addr;
> >
> >       if (PageUptodate(page))
> >               return;
> >
> > -     BUG_ON(page->index);
> > -     BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
> > +     BUG_ON(!inline_data_within_block(inode, iomap, size));
> >
> >       addr = kmap_atomic(page);
> > -     memcpy(addr, iomap->inline_data, size);
> > -     memset(addr + size, 0, PAGE_SIZE - size);
> > +     memcpy(addr + poff, iomap->inline_data, size);
> > +     memset(addr + poff + size, 0, bsize - size);
> >       kunmap_atomic(addr);
> > -     SetPageUptodate(page);
> > +}
> > +
> > +static void
> > +iomap_read_inline_data(struct inode *inode, struct page *page,
> > +             struct iomap *iomap)
> > +{
> > +     unsigned int poff = offset_in_page(iomap->offset);
> > +     unsigned int bsize = i_blocksize(inode);
> > +
> > +     __iomap_read_inline_data(inode, page, iomap);
> > +     iomap_set_range_uptodate(page, poff, bsize);
> >  }
> >
> >  static loff_t
> > @@ -292,11 +315,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >       unsigned poff, plen;
> >       sector_t sector;
> >
> > -     if (iomap->type == IOMAP_INLINE) {
> > -             WARN_ON_ONCE(pos);
> > +     if (iomap->type == IOMAP_INLINE)
> >               iomap_read_inline_data(inode, page, iomap);
> > -             return PAGE_SIZE;
>
> Hi Andreas,
>
> Thanks for your patch.
>
> In my erofs test case, filled inline data will be zeroed out due to we fallback
> to following flow:
>
>         if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
>                 zero_user(page, poff, plen);
>
> Should we return before this condition check?

Yes, probably by returning i_blocksize(inode) after
iomap_read_inline_data, but that alone isn't enough to make the patch
work completely. This really needs a review from Christoph and careful
testing of all the code paths.

Andreas

> Thanks,
>
> > -     }
> >
> >       /* zero post-eof blocks as the page may be mapped */
> >       iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
> > @@ -637,6 +657,11 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
> >       if (PageUptodate(page))
> >               return 0;
> >
> > +     if (iomap->type == IOMAP_INLINE) {
> > +             iomap_read_inline_data(inode, page, iomap);
> > +             return 0;
> > +     }
> > +
> >       do {
> >               iomap_adjust_read_range(inode, iop, &block_start,
> >                               block_end - block_start, &poff, &plen);
> > @@ -682,9 +707,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> >               goto out_no_page;
> >       }
> >
> > -     if (iomap->type == IOMAP_INLINE)
> > -             iomap_read_inline_data(inode, page, iomap);
> > -     else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> > +     if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> >               status = __block_write_begin_int(page, pos, len, NULL, iomap);
> >       else
> >               status = __iomap_write_begin(inode, pos, len, page, iomap);
> > @@ -761,11 +784,11 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
> >  {
> >       void *addr;
> >
> > -     WARN_ON_ONCE(!PageUptodate(page));
> > -     BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
> > +     BUG_ON(!inline_data_within_block(inode, iomap, pos + copied));
> >
> >       addr = kmap_atomic(page);
> > -     memcpy(iomap->inline_data + pos, addr + pos, copied);
> > +     memcpy(iomap->inline_data + offset_in_block(pos, inode),
> > +            addr + offset_in_page(pos), copied);
> >       kunmap_atomic(addr);
> >
> >       mark_inode_dirty(inode);
> > @@ -1064,7 +1087,7 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> >               const struct iomap_ops *ops)
> >  {
> >       unsigned int blocksize = i_blocksize(inode);
> > -     unsigned int off = pos & (blocksize - 1);
> > +     unsigned int off = offset_in_block(pos, inode);
> >
> >       /* Block boundary? Nothing to do */
> >       if (!off)
> > @@ -1772,21 +1795,26 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
> >       struct iov_iter *iter = dio->submit.iter;
> >       size_t copied;
> >
> > -     BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data));
> > +     BUG_ON(!inline_data_within_block(inode, iomap, pos + length));
> >
> >       if (dio->flags & IOMAP_DIO_WRITE) {
> >               loff_t size = inode->i_size;
> >
> >               if (pos > size)
> > -                     memset(iomap->inline_data + size, 0, pos - size);
> > -             copied = copy_from_iter(iomap->inline_data + pos, length, iter);
> > +                     memset(iomap->inline_data +
> > +                            offset_in_block(size, inode), 0, pos - size);
> > +             copied = copy_from_iter(iomap->inline_data +
> > +                                     offset_in_block(pos, inode),
> > +                                     length, iter);
> >               if (copied) {
> >                       if (pos + copied > size)
> >                               i_size_write(inode, pos + copied);
> >                       mark_inode_dirty(inode);
> >               }
> >       } else {
> > -             copied = copy_to_iter(iomap->inline_data + pos, length, iter);
> > +             copied = copy_to_iter(iomap->inline_data +
> > +                                   offset_in_block(pos, inode),
> > +                                   length, iter);
> >       }
> >       dio->size += copied;
> >       return copied;
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 2103b94cb1bf..a8a60dd2fdc0 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -131,6 +131,9 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
> >       return NULL;
> >  }
> >
> > +void __iomap_read_inline_data(struct inode *inode, struct page *page,
> > +             struct iomap *iomap);
> > +
> >  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
> >               const struct iomap_ops *ops);
> >  int iomap_readpage(struct page *page, const struct iomap_ops *ops);
> >
Chao Yu July 15, 2019, 9:26 a.m. UTC | #14
On 2019/7/12 19:54, Andreas Gruenbacher wrote:
> On Fri, 12 Jul 2019 at 11:31, Chao Yu <yuchao0@huawei.com> wrote:
>> On 2019/7/11 20:28, Andreas Gruenbacher wrote:
>>> Something along the lines of the attached, broken patch might work in
>>> the end.
>>>
>>> Andreas
>>>
>>> ---
>>>  fs/buffer.c           | 10 ++++--
>>>  fs/iomap.c            | 74 +++++++++++++++++++++++++++++--------------
>>>  include/linux/iomap.h |  3 ++
>>>  3 files changed, 61 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>> index e450c55f6434..8d8668e377ab 100644
>>> --- a/fs/buffer.c
>>> +++ b/fs/buffer.c
>>> @@ -1873,8 +1873,8 @@ void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
>>>  EXPORT_SYMBOL(page_zero_new_buffers);
>>>
>>>  static void
>>> -iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
>>> -             struct iomap *iomap)
>>> +iomap_to_bh(struct inode *inode, struct page *page, sector_t block,
>>> +             struct buffer_head *bh, struct iomap *iomap)
>>>  {
>>>       loff_t offset = block << inode->i_blkbits;
>>>
>>> @@ -1924,6 +1924,10 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
>>>                               inode->i_blkbits;
>>>               set_buffer_mapped(bh);
>>>               break;
>>> +     case IOMAP_INLINE:
>>> +             __iomap_read_inline_data(inode, page, iomap);
>>> +             set_buffer_uptodate(bh);
>>> +             break;
>>>       }
>>>  }
>>>
>>> @@ -1969,7 +1973,7 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
>>>                               if (err)
>>>                                       break;
>>>                       } else {
>>> -                             iomap_to_bh(inode, block, bh, iomap);
>>> +                             iomap_to_bh(inode, page, block, bh, iomap);
>>>                       }
>>>
>>>                       if (buffer_new(bh)) {
>>> diff --git a/fs/iomap.c b/fs/iomap.c
>>> index 45aa58e837b5..61188e95def2 100644
>>> --- a/fs/iomap.c
>>> +++ b/fs/iomap.c
>>> @@ -260,24 +260,47 @@ struct iomap_readpage_ctx {
>>>       struct list_head        *pages;
>>>  };
>>>
>>> -static void
>>> -iomap_read_inline_data(struct inode *inode, struct page *page,
>>> +#define offset_in_block(offset, inode) \
>>> +     ((unsigned long)(offset) & (i_blocksize(inode) - 1))
>>> +
>>> +static bool
>>> +inline_data_within_block(struct inode *inode, struct iomap *iomap,
>>> +             unsigned int size)
>>> +{
>>> +     unsigned int off = offset_in_block(iomap->inline_data, inode);
>>> +
>>> +     return size <= i_blocksize(inode) - off;
>>> +}
>>> +
>>> +void
>>> +__iomap_read_inline_data(struct inode *inode, struct page *page,
>>>               struct iomap *iomap)
>>>  {
>>> -     size_t size = i_size_read(inode);
>>> +     size_t size = offset_in_block(i_size_read(inode), inode);
>>> +     unsigned int poff = offset_in_page(iomap->offset);
>>> +     unsigned int bsize = i_blocksize(inode);
>>>       void *addr;
>>>
>>>       if (PageUptodate(page))
>>>               return;
>>>
>>> -     BUG_ON(page->index);
>>> -     BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>> +     BUG_ON(!inline_data_within_block(inode, iomap, size));
>>>
>>>       addr = kmap_atomic(page);
>>> -     memcpy(addr, iomap->inline_data, size);
>>> -     memset(addr + size, 0, PAGE_SIZE - size);
>>> +     memcpy(addr + poff, iomap->inline_data, size);
>>> +     memset(addr + poff + size, 0, bsize - size);
>>>       kunmap_atomic(addr);
>>> -     SetPageUptodate(page);
>>> +}
>>> +
>>> +static void
>>> +iomap_read_inline_data(struct inode *inode, struct page *page,
>>> +             struct iomap *iomap)
>>> +{
>>> +     unsigned int poff = offset_in_page(iomap->offset);
>>> +     unsigned int bsize = i_blocksize(inode);
>>> +
>>> +     __iomap_read_inline_data(inode, page, iomap);
>>> +     iomap_set_range_uptodate(page, poff, bsize);
>>>  }
>>>
>>>  static loff_t
>>> @@ -292,11 +315,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>>>       unsigned poff, plen;
>>>       sector_t sector;
>>>
>>> -     if (iomap->type == IOMAP_INLINE) {
>>> -             WARN_ON_ONCE(pos);
>>> +     if (iomap->type == IOMAP_INLINE)
>>>               iomap_read_inline_data(inode, page, iomap);
>>> -             return PAGE_SIZE;
>>
>> Hi Andreas,
>>
>> Thanks for your patch.
>>
>> In my erofs test case, filled inline data will be zeroed out due to we fallback
>> to following flow:
>>
>>         if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
>>                 zero_user(page, poff, plen);
>>
>> Should we return before this condition check?
> 
> Yes, probably by returning i_blocksize(inode) after
> iomap_read_inline_data, but that alone isn't enough to make the patch

Could you please resend this diff as a new patch with above fix?

> work completely. This really needs a review from Christoph and careful
> testing of all the code paths.

Later, I can do the test on read path.

Thanks,

> 
> Andreas
> 
>> Thanks,
>>
>>> -     }
>>>
>>>       /* zero post-eof blocks as the page may be mapped */
>>>       iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
>>> @@ -637,6 +657,11 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
>>>       if (PageUptodate(page))
>>>               return 0;
>>>
>>> +     if (iomap->type == IOMAP_INLINE) {
>>> +             iomap_read_inline_data(inode, page, iomap);
>>> +             return 0;
>>> +     }
>>> +
>>>       do {
>>>               iomap_adjust_read_range(inode, iop, &block_start,
>>>                               block_end - block_start, &poff, &plen);
>>> @@ -682,9 +707,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>>>               goto out_no_page;
>>>       }
>>>
>>> -     if (iomap->type == IOMAP_INLINE)
>>> -             iomap_read_inline_data(inode, page, iomap);
>>> -     else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>>> +     if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>>>               status = __block_write_begin_int(page, pos, len, NULL, iomap);
>>>       else
>>>               status = __iomap_write_begin(inode, pos, len, page, iomap);
>>> @@ -761,11 +784,11 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
>>>  {
>>>       void *addr;
>>>
>>> -     WARN_ON_ONCE(!PageUptodate(page));
>>> -     BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>> +     BUG_ON(!inline_data_within_block(inode, iomap, pos + copied));
>>>
>>>       addr = kmap_atomic(page);
>>> -     memcpy(iomap->inline_data + pos, addr + pos, copied);
>>> +     memcpy(iomap->inline_data + offset_in_block(pos, inode),
>>> +            addr + offset_in_page(pos), copied);
>>>       kunmap_atomic(addr);
>>>
>>>       mark_inode_dirty(inode);
>>> @@ -1064,7 +1087,7 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>>>               const struct iomap_ops *ops)
>>>  {
>>>       unsigned int blocksize = i_blocksize(inode);
>>> -     unsigned int off = pos & (blocksize - 1);
>>> +     unsigned int off = offset_in_block(pos, inode);
>>>
>>>       /* Block boundary? Nothing to do */
>>>       if (!off)
>>> @@ -1772,21 +1795,26 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
>>>       struct iov_iter *iter = dio->submit.iter;
>>>       size_t copied;
>>>
>>> -     BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>> +     BUG_ON(!inline_data_within_block(inode, iomap, pos + length));
>>>
>>>       if (dio->flags & IOMAP_DIO_WRITE) {
>>>               loff_t size = inode->i_size;
>>>
>>>               if (pos > size)
>>> -                     memset(iomap->inline_data + size, 0, pos - size);
>>> -             copied = copy_from_iter(iomap->inline_data + pos, length, iter);
>>> +                     memset(iomap->inline_data +
>>> +                            offset_in_block(size, inode), 0, pos - size);
>>> +             copied = copy_from_iter(iomap->inline_data +
>>> +                                     offset_in_block(pos, inode),
>>> +                                     length, iter);
>>>               if (copied) {
>>>                       if (pos + copied > size)
>>>                               i_size_write(inode, pos + copied);
>>>                       mark_inode_dirty(inode);
>>>               }
>>>       } else {
>>> -             copied = copy_to_iter(iomap->inline_data + pos, length, iter);
>>> +             copied = copy_to_iter(iomap->inline_data +
>>> +                                   offset_in_block(pos, inode),
>>> +                                   length, iter);
>>>       }
>>>       dio->size += copied;
>>>       return copied;
>>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>>> index 2103b94cb1bf..a8a60dd2fdc0 100644
>>> --- a/include/linux/iomap.h
>>> +++ b/include/linux/iomap.h
>>> @@ -131,6 +131,9 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
>>>       return NULL;
>>>  }
>>>
>>> +void __iomap_read_inline_data(struct inode *inode, struct page *page,
>>> +             struct iomap *iomap);
>>> +
>>>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>>>               const struct iomap_ops *ops);
>>>  int iomap_readpage(struct page *page, const struct iomap_ops *ops);
>>>
> .
>
Chao Yu July 17, 2019, 2:58 a.m. UTC | #15
Ping, Andreas. :P

On 2019/7/15 17:26, Chao Yu wrote:
> On 2019/7/12 19:54, Andreas Gruenbacher wrote:
>> On Fri, 12 Jul 2019 at 11:31, Chao Yu <yuchao0@huawei.com> wrote:
>>> On 2019/7/11 20:28, Andreas Gruenbacher wrote:
>>>> Something along the lines of the attached, broken patch might work in
>>>> the end.
>>>>
>>>> Andreas
>>>>
>>>> ---
>>>>  fs/buffer.c           | 10 ++++--
>>>>  fs/iomap.c            | 74 +++++++++++++++++++++++++++++--------------
>>>>  include/linux/iomap.h |  3 ++
>>>>  3 files changed, 61 insertions(+), 26 deletions(-)
>>>>
>>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>>> index e450c55f6434..8d8668e377ab 100644
>>>> --- a/fs/buffer.c
>>>> +++ b/fs/buffer.c
>>>> @@ -1873,8 +1873,8 @@ void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
>>>>  EXPORT_SYMBOL(page_zero_new_buffers);
>>>>
>>>>  static void
>>>> -iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
>>>> -             struct iomap *iomap)
>>>> +iomap_to_bh(struct inode *inode, struct page *page, sector_t block,
>>>> +             struct buffer_head *bh, struct iomap *iomap)
>>>>  {
>>>>       loff_t offset = block << inode->i_blkbits;
>>>>
>>>> @@ -1924,6 +1924,10 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
>>>>                               inode->i_blkbits;
>>>>               set_buffer_mapped(bh);
>>>>               break;
>>>> +     case IOMAP_INLINE:
>>>> +             __iomap_read_inline_data(inode, page, iomap);
>>>> +             set_buffer_uptodate(bh);
>>>> +             break;
>>>>       }
>>>>  }
>>>>
>>>> @@ -1969,7 +1973,7 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
>>>>                               if (err)
>>>>                                       break;
>>>>                       } else {
>>>> -                             iomap_to_bh(inode, block, bh, iomap);
>>>> +                             iomap_to_bh(inode, page, block, bh, iomap);
>>>>                       }
>>>>
>>>>                       if (buffer_new(bh)) {
>>>> diff --git a/fs/iomap.c b/fs/iomap.c
>>>> index 45aa58e837b5..61188e95def2 100644
>>>> --- a/fs/iomap.c
>>>> +++ b/fs/iomap.c
>>>> @@ -260,24 +260,47 @@ struct iomap_readpage_ctx {
>>>>       struct list_head        *pages;
>>>>  };
>>>>
>>>> -static void
>>>> -iomap_read_inline_data(struct inode *inode, struct page *page,
>>>> +#define offset_in_block(offset, inode) \
>>>> +     ((unsigned long)(offset) & (i_blocksize(inode) - 1))
>>>> +
>>>> +static bool
>>>> +inline_data_within_block(struct inode *inode, struct iomap *iomap,
>>>> +             unsigned int size)
>>>> +{
>>>> +     unsigned int off = offset_in_block(iomap->inline_data, inode);
>>>> +
>>>> +     return size <= i_blocksize(inode) - off;
>>>> +}
>>>> +
>>>> +void
>>>> +__iomap_read_inline_data(struct inode *inode, struct page *page,
>>>>               struct iomap *iomap)
>>>>  {
>>>> -     size_t size = i_size_read(inode);
>>>> +     size_t size = offset_in_block(i_size_read(inode), inode);
>>>> +     unsigned int poff = offset_in_page(iomap->offset);
>>>> +     unsigned int bsize = i_blocksize(inode);
>>>>       void *addr;
>>>>
>>>>       if (PageUptodate(page))
>>>>               return;
>>>>
>>>> -     BUG_ON(page->index);
>>>> -     BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>>> +     BUG_ON(!inline_data_within_block(inode, iomap, size));
>>>>
>>>>       addr = kmap_atomic(page);
>>>> -     memcpy(addr, iomap->inline_data, size);
>>>> -     memset(addr + size, 0, PAGE_SIZE - size);
>>>> +     memcpy(addr + poff, iomap->inline_data, size);
>>>> +     memset(addr + poff + size, 0, bsize - size);
>>>>       kunmap_atomic(addr);
>>>> -     SetPageUptodate(page);
>>>> +}
>>>> +
>>>> +static void
>>>> +iomap_read_inline_data(struct inode *inode, struct page *page,
>>>> +             struct iomap *iomap)
>>>> +{
>>>> +     unsigned int poff = offset_in_page(iomap->offset);
>>>> +     unsigned int bsize = i_blocksize(inode);
>>>> +
>>>> +     __iomap_read_inline_data(inode, page, iomap);
>>>> +     iomap_set_range_uptodate(page, poff, bsize);
>>>>  }
>>>>
>>>>  static loff_t
>>>> @@ -292,11 +315,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>>>>       unsigned poff, plen;
>>>>       sector_t sector;
>>>>
>>>> -     if (iomap->type == IOMAP_INLINE) {
>>>> -             WARN_ON_ONCE(pos);
>>>> +     if (iomap->type == IOMAP_INLINE)
>>>>               iomap_read_inline_data(inode, page, iomap);
>>>> -             return PAGE_SIZE;
>>>
>>> Hi Andreas,
>>>
>>> Thanks for your patch.
>>>
>>> In my erofs test case, filled inline data will be zeroed out due to we fallback
>>> to following flow:
>>>
>>>         if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
>>>                 zero_user(page, poff, plen);
>>>
>>> Should we return before this condition check?
>>
>> Yes, probably by returning i_blocksize(inode) after
>> iomap_read_inline_data, but that alone isn't enough to make the patch
> 
> Could you please resend this diff as a new patch with above fix?
> 
>> work completely. This really needs a review from Christoph and careful
>> testing of all the code paths.
> 
> Later, I can do the test on read path.
> 
> Thanks,
> 
>>
>> Andreas
>>
>>> Thanks,
>>>
>>>> -     }
>>>>
>>>>       /* zero post-eof blocks as the page may be mapped */
>>>>       iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
>>>> @@ -637,6 +657,11 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
>>>>       if (PageUptodate(page))
>>>>               return 0;
>>>>
>>>> +     if (iomap->type == IOMAP_INLINE) {
>>>> +             iomap_read_inline_data(inode, page, iomap);
>>>> +             return 0;
>>>> +     }
>>>> +
>>>>       do {
>>>>               iomap_adjust_read_range(inode, iop, &block_start,
>>>>                               block_end - block_start, &poff, &plen);
>>>> @@ -682,9 +707,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>>>>               goto out_no_page;
>>>>       }
>>>>
>>>> -     if (iomap->type == IOMAP_INLINE)
>>>> -             iomap_read_inline_data(inode, page, iomap);
>>>> -     else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>>>> +     if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>>>>               status = __block_write_begin_int(page, pos, len, NULL, iomap);
>>>>       else
>>>>               status = __iomap_write_begin(inode, pos, len, page, iomap);
>>>> @@ -761,11 +784,11 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
>>>>  {
>>>>       void *addr;
>>>>
>>>> -     WARN_ON_ONCE(!PageUptodate(page));
>>>> -     BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>>> +     BUG_ON(!inline_data_within_block(inode, iomap, pos + copied));
>>>>
>>>>       addr = kmap_atomic(page);
>>>> -     memcpy(iomap->inline_data + pos, addr + pos, copied);
>>>> +     memcpy(iomap->inline_data + offset_in_block(pos, inode),
>>>> +            addr + offset_in_page(pos), copied);
>>>>       kunmap_atomic(addr);
>>>>
>>>>       mark_inode_dirty(inode);
>>>> @@ -1064,7 +1087,7 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>>>>               const struct iomap_ops *ops)
>>>>  {
>>>>       unsigned int blocksize = i_blocksize(inode);
>>>> -     unsigned int off = pos & (blocksize - 1);
>>>> +     unsigned int off = offset_in_block(pos, inode);
>>>>
>>>>       /* Block boundary? Nothing to do */
>>>>       if (!off)
>>>> @@ -1772,21 +1795,26 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
>>>>       struct iov_iter *iter = dio->submit.iter;
>>>>       size_t copied;
>>>>
>>>> -     BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data));
>>>> +     BUG_ON(!inline_data_within_block(inode, iomap, pos + length));
>>>>
>>>>       if (dio->flags & IOMAP_DIO_WRITE) {
>>>>               loff_t size = inode->i_size;
>>>>
>>>>               if (pos > size)
>>>> -                     memset(iomap->inline_data + size, 0, pos - size);
>>>> -             copied = copy_from_iter(iomap->inline_data + pos, length, iter);
>>>> +                     memset(iomap->inline_data +
>>>> +                            offset_in_block(size, inode), 0, pos - size);
>>>> +             copied = copy_from_iter(iomap->inline_data +
>>>> +                                     offset_in_block(pos, inode),
>>>> +                                     length, iter);
>>>>               if (copied) {
>>>>                       if (pos + copied > size)
>>>>                               i_size_write(inode, pos + copied);
>>>>                       mark_inode_dirty(inode);
>>>>               }
>>>>       } else {
>>>> -             copied = copy_to_iter(iomap->inline_data + pos, length, iter);
>>>> +             copied = copy_to_iter(iomap->inline_data +
>>>> +                                   offset_in_block(pos, inode),
>>>> +                                   length, iter);
>>>>       }
>>>>       dio->size += copied;
>>>>       return copied;
>>>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>>>> index 2103b94cb1bf..a8a60dd2fdc0 100644
>>>> --- a/include/linux/iomap.h
>>>> +++ b/include/linux/iomap.h
>>>> @@ -131,6 +131,9 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
>>>>       return NULL;
>>>>  }
>>>>
>>>> +void __iomap_read_inline_data(struct inode *inode, struct page *page,
>>>> +             struct iomap *iomap);
>>>> +
>>>>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>>>>               const struct iomap_ops *ops);
>>>>  int iomap_readpage(struct page *page, const struct iomap_ops *ops);
>>>>
>> .
>>
> .
>
Christoph Hellwig July 18, 2019, 12:31 p.m. UTC | #16
> +iomap_to_bh(struct inode *inode, struct page *page, sector_t block,
> +		struct buffer_head *bh, struct iomap *iomap)
>  {
>  	loff_t offset = block << inode->i_blkbits;
>  
> @@ -1924,6 +1924,10 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
>  				inode->i_blkbits;
>  		set_buffer_mapped(bh);
>  		break;
> +	case IOMAP_INLINE:
> +		__iomap_read_inline_data(inode, page, iomap);
> +		set_buffer_uptodate(bh);
> +		break;

I have to say I hate pushing this into the legacy buffer_head code.
My hope was that we could get rid of this code rather than adding to it.

The other issue is that this now calls iomap functions from buffer.c,
which I'd also really avoid.

That being said until the tail packing fs (erofs?) actually uses
buffer_heads we should not need this hunk, and I don't think erofs
should have any reason to use buffer_heads.

> +#define offset_in_block(offset, inode) \
> +	((unsigned long)(offset) & (i_blocksize(inode) - 1))

Make this an inline function, please.  I think we also have a few
other places that could make use of this helper, maybe it might
even go into fs.h.

Otherwise this looks sensible to me.
Christoph Hellwig July 18, 2019, 12:33 p.m. UTC | #17
On Fri, Jul 12, 2019 at 01:54:07PM +0200, Andreas Gruenbacher wrote:
> Yes, probably by returning i_blocksize(inode) after
> iomap_read_inline_data, but that alone isn't enough to make the patch
> work completely. This really needs a review from Christoph and careful
> testing of all the code paths.

Well, lets start with the testing.  Can we get a coherent series that
includes the erofs bits?
Gao Xiang July 18, 2019, 1:27 p.m. UTC | #18
On 2019/7/18 20:31, Christoph Hellwig wrote:
> That being said until the tail packing fs (erofs?) actually uses
> buffer_heads we should not need this hunk, and I don't think erofs
> should have any reason to use buffer_heads.

Yes, erofs will not use buffer_head to support sub-page blocksize
in the long term. too heavy and no need...

Thanks,
Gao Xiang
diff mbox series

Patch

diff --git a/fs/iomap.c b/fs/iomap.c
index 12654c2e78f8..d1c16b692d31 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -264,13 +264,12 @@  static void
 iomap_read_inline_data(struct inode *inode, struct page *page,
 		struct iomap *iomap)
 {
-	size_t size = i_size_read(inode);
+	size_t size = iomap->length;
 	void *addr;
 
 	if (PageUptodate(page))
 		return;
 
-	BUG_ON(page->index);
 	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
@@ -293,7 +292,6 @@  iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	sector_t sector;
 
 	if (iomap->type == IOMAP_INLINE) {
-		WARN_ON_ONCE(pos);
 		iomap_read_inline_data(inode, page, iomap);
 		return PAGE_SIZE;
 	}