xfs: allocate sector sized IO buffer via page_frag_alloc

Message ID	20190225040904.5557-1-ming.lei@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Ming Lei <ming.lei@redhat.com> To: "Darrick J . Wong" <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>, Vitaly Kuznetsov <vkuznets@redhat.com>, Dave Chinner <dchinner@redhat.com>, Christoph Hellwig <hch@lst.de>, Alexander Duyck <alexander.h.duyck@linux.intel.com>, Aaron Lu <aaron.lu@intel.com>, Christopher Lameter <cl@linux.com>, Linux FS Devel <linux-fsdevel@vger.kernel.org>, linux-mm@kvack.org, linux-block@vger.kernel.org Subject: [PATCH] xfs: allocate sector sized IO buffer via page_frag_alloc Date: Mon, 25 Feb 2019 12:09:04 +0800 Message-Id: <20190225040904.5557-1-ming.lei@redhat.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	xfs: allocate sector sized IO buffer via page_frag_alloc \| expand xfs: allocate sector sized IO buffer via page_frag_alloc

Ming Lei Feb. 25, 2019, 4:09 a.m. UTC

XFS uses kmalloc() to allocate sector sized IO buffer.

Turns out buffer allocated via kmalloc(sector sized) can't
be guaranteed to be 512 byte aligned, and actually slab only provides
ARCH_KMALLOC_MINALIGN alignment, even though it is observed
that the sector size allocation is often 512 byte aligned. When
KASAN or other memory debug options are enabled, the allocated
buffer becomes not aliged with 512 byte any more.

This unalgined IO buffer causes at least two issues:

1) some storage controller requires IO buffer to be 512 byte aligned,
and data corruption is observed

2) loop/dio requires the IO buffer to be logical block size aligned,
and loop's default logcial block size is 512 byte, then one xfs image
can't be mounted via loop/dio any more.

Use page_frag_alloc() to allocate the sector sized buffer, then the
above issue can be fixed because offset_in_page of allocated buffer
is always sector aligned.

Not see any regression with this patch on xfstests.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-block@vger.kernel.org
Link: https://marc.info/?t=153734857500004&r=1&w=2
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 fs/xfs/xfs_buf.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

Dave Chinner Feb. 25, 2019, 4:36 a.m. UTC | #1

On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
> XFS uses kmalloc() to allocate sector sized IO buffer.
....
> Use page_frag_alloc() to allocate the sector sized buffer, then the
> above issue can be fixed because offset_in_page of allocated buffer
> is always sector aligned.

Didn't we already reject this approach because page frags cannot be
reused and that pages allocated to the frag pool are pinned in
memory until all fragments allocated on the page have been freed?

i.e. when we consider 64k page machines and 4k block sizes (i.e.
default config), every single metadata allocation is a sub-page
allocation and so will use this new page frag mechanism. IOWs, it
will result in fragmenting memory severely and typical memory
reclaim not being able to fix it because the metadata that pins each
page is largely unreclaimable...

Cheers,

Dave.

Ming Lei Feb. 25, 2019, 8:46 a.m. UTC | #2

On Mon, Feb 25, 2019 at 03:36:48PM +1100, Dave Chinner wrote:
> On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
> > XFS uses kmalloc() to allocate sector sized IO buffer.
> ....
> > Use page_frag_alloc() to allocate the sector sized buffer, then the
> > above issue can be fixed because offset_in_page of allocated buffer
> > is always sector aligned.
> 
> Didn't we already reject this approach because page frags cannot be

I remembered there is this kind of issue mentioned, but just not found
the details, so post out the patch for restarting the discussion.

> reused and that pages allocated to the frag pool are pinned in
> memory until all fragments allocated on the page have been freed?

Yes, that is one problem. But if one page is consumed, sooner or later,
all fragments will be freed, then the page becomes available again.

> 
> i.e. when we consider 64k page machines and 4k block sizes (i.e.
> default config), every single metadata allocation is a sub-page
> allocation and so will use this new page frag mechanism. IOWs, it
> will result in fragmenting memory severely and typical memory
> reclaim not being able to fix it because the metadata that pins each
> page is largely unreclaimable...

It can be an issue in case of IO timeout & retry.


Thanks,
Ming

Ming Lei Feb. 25, 2019, 10:03 a.m. UTC | #3

On Mon, Feb 25, 2019 at 04:46:25PM +0800, Ming Lei wrote:
> On Mon, Feb 25, 2019 at 03:36:48PM +1100, Dave Chinner wrote:
> > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
> > > XFS uses kmalloc() to allocate sector sized IO buffer.
> > ....
> > > Use page_frag_alloc() to allocate the sector sized buffer, then the
> > > above issue can be fixed because offset_in_page of allocated buffer
> > > is always sector aligned.
> > 
> > Didn't we already reject this approach because page frags cannot be
> 
> I remembered there is this kind of issue mentioned, but just not found
> the details, so post out the patch for restarting the discussion.
> 
> > reused and that pages allocated to the frag pool are pinned in
> > memory until all fragments allocated on the page have been freed?
> 
> Yes, that is one problem. But if one page is consumed, sooner or later,
> all fragments will be freed, then the page becomes available again.
> 
> > 
> > i.e. when we consider 64k page machines and 4k block sizes (i.e.
> > default config), every single metadata allocation is a sub-page
> > allocation and so will use this new page frag mechanism. IOWs, it
> > will result in fragmenting memory severely and typical memory
> > reclaim not being able to fix it because the metadata that pins each
> > page is largely unreclaimable...
> 
> It can be an issue in case of IO timeout & retry.

The worst case is still not worse than allocating single page for sub-page
IO, which should be used on other file systems under the same situation,
I guess.

thanks,
Ming

Vlastimil Babka Feb. 25, 2019, 1:15 p.m. UTC | #4

On 2/25/19 5:36 AM, Dave Chinner wrote:
> On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
>> XFS uses kmalloc() to allocate sector sized IO buffer.
> ....
>> Use page_frag_alloc() to allocate the sector sized buffer, then the
>> above issue can be fixed because offset_in_page of allocated buffer
>> is always sector aligned.
> 
> Didn't we already reject this approach because page frags cannot be
> reused and that pages allocated to the frag pool are pinned in
> memory until all fragments allocated on the page have been freed?

I don't know if you did, but it's certainly true., Also I don't think
there's any specified alignment guarantee for page_frag_alloc().

What about kmem_cache_create() with align parameter? That *should* be
guaranteed regardless of whatever debugging is enabled - if not, I would
consider it a bug.

> i.e. when we consider 64k page machines and 4k block sizes (i.e.
> default config), every single metadata allocation is a sub-page
> allocation and so will use this new page frag mechanism. IOWs, it
> will result in fragmenting memory severely and typical memory
> reclaim not being able to fix it because the metadata that pins each
> page is largely unreclaimable...
> 
> Cheers,
> 
> Dave.
>

Dave Chinner Feb. 25, 2019, 8:11 p.m. UTC | #5

On Mon, Feb 25, 2019 at 04:46:25PM +0800, Ming Lei wrote:
> On Mon, Feb 25, 2019 at 03:36:48PM +1100, Dave Chinner wrote:
> > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
> > > XFS uses kmalloc() to allocate sector sized IO buffer.
> > ....
> > > Use page_frag_alloc() to allocate the sector sized buffer, then the
> > > above issue can be fixed because offset_in_page of allocated buffer
> > > is always sector aligned.
> > 
> > Didn't we already reject this approach because page frags cannot be
> 
> I remembered there is this kind of issue mentioned, but just not found
> the details, so post out the patch for restarting the discussion.

As previously discussed, the only solution that fits all use cases
we have to support are a slab caches that do not break object
alignment when slab debug options are turned on.

> > reused and that pages allocated to the frag pool are pinned in
> > memory until all fragments allocated on the page have been freed?
> 
> Yes, that is one problem. But if one page is consumed, sooner or later,
> all fragments will be freed, then the page becomes available again.

Ah, no, your assumption about how metadata caching in XFS works is
flawed. Some metadata ends up being cached for the life of the
filesystem because it is so frequently referenced it never gets
reclaimed. AG headers, btree root blocks, etc.  And the XFS metadata
cache hangs on to such metadata even under extreme memory pressure
because if we reclaim it then any filesystem operation will need to
reallocate that memory to clean dirty pages and that is the very
last thing we want to do under extreme memory pressure conditions.

If allocation cannot reuse holes in pages (i.e. works as a proper
slab cache) then we are going to blow out the amount of memory that
the XFS metadata cache uses very badly on filesystems where block
size != page size. 

> > i.e. when we consider 64k page machines and 4k block sizes (i.e.
> > default config), every single metadata allocation is a sub-page
> > allocation and so will use this new page frag mechanism. IOWs, it
> > will result in fragmenting memory severely and typical memory
> > reclaim not being able to fix it because the metadata that pins each
> > page is largely unreclaimable...
> 
> It can be an issue in case of IO timeout & retry.

This makes no sense to me. Exactly how does filesystem memory
allocation affect IO timeouts and any retries the filesystem might
issue?

Cheers,

Dave.

Dave Chinner Feb. 25, 2019, 8:26 p.m. UTC | #6

On Mon, Feb 25, 2019 at 02:15:59PM +0100, Vlastimil Babka wrote:
> On 2/25/19 5:36 AM, Dave Chinner wrote:
> > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
> >> XFS uses kmalloc() to allocate sector sized IO buffer.
> > ....
> >> Use page_frag_alloc() to allocate the sector sized buffer, then the
> >> above issue can be fixed because offset_in_page of allocated buffer
> >> is always sector aligned.
> > 
> > Didn't we already reject this approach because page frags cannot be
> > reused and that pages allocated to the frag pool are pinned in
> > memory until all fragments allocated on the page have been freed?
> 
> I don't know if you did, but it's certainly true., Also I don't think
> there's any specified alignment guarantee for page_frag_alloc().

We did, and the alignment guarantee would have come from all
fragments having an aligned size.

> What about kmem_cache_create() with align parameter? That *should* be
> guaranteed regardless of whatever debugging is enabled - if not, I would
> consider it a bug.

Yup, that's pretty much what was decided. The sticking point was
whether is should be block layer infrastructure (because the actual
memory buffer alignment is a block/device driver requirement not
visible to the filesystem) or whether "sector size alignement is
good enough for everyone".

Cheers,

Dave.

Ming Lei Feb. 26, 2019, 2:22 a.m. UTC | #7

On Tue, Feb 26, 2019 at 07:26:30AM +1100, Dave Chinner wrote:
> On Mon, Feb 25, 2019 at 02:15:59PM +0100, Vlastimil Babka wrote:
> > On 2/25/19 5:36 AM, Dave Chinner wrote:
> > > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
> > >> XFS uses kmalloc() to allocate sector sized IO buffer.
> > > ....
> > >> Use page_frag_alloc() to allocate the sector sized buffer, then the
> > >> above issue can be fixed because offset_in_page of allocated buffer
> > >> is always sector aligned.
> > > 
> > > Didn't we already reject this approach because page frags cannot be
> > > reused and that pages allocated to the frag pool are pinned in
> > > memory until all fragments allocated on the page have been freed?
> > 
> > I don't know if you did, but it's certainly true., Also I don't think
> > there's any specified alignment guarantee for page_frag_alloc().
> 
> We did, and the alignment guarantee would have come from all
> fragments having an aligned size.
> 
> > What about kmem_cache_create() with align parameter? That *should* be
> > guaranteed regardless of whatever debugging is enabled - if not, I would
> > consider it a bug.
> 
> Yup, that's pretty much what was decided. The sticking point was
> whether is should be block layer infrastructure (because the actual
> memory buffer alignment is a block/device driver requirement not
> visible to the filesystem) or whether "sector size alignement is
> good enough for everyone".

OK, looks I miss the long life time of meta data caching, then let's
discuss the slab approach.

Looks one single slab cache doesn't work, given the size may be 512 * N
(1 <= N < PAGE_SIZE/512), that is basically what I posted the first
time.

https://marc.info/?t=153986884900007&r=1&w=2
https://marc.info/?t=153986885100001&r=1&w=2

Or what is the exact size of sub-page IO in xfs most of time? For
example, if 99% times falls in 512 byte allocation, maybe it is enough
to just maintain one 512byte slab.

Thanks,
Ming

Dave Chinner Feb. 26, 2019, 3:02 a.m. UTC | #8

On Tue, Feb 26, 2019 at 10:22:50AM +0800, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 07:26:30AM +1100, Dave Chinner wrote:
> > On Mon, Feb 25, 2019 at 02:15:59PM +0100, Vlastimil Babka wrote:
> > > On 2/25/19 5:36 AM, Dave Chinner wrote:
> > > > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote:
> > > >> XFS uses kmalloc() to allocate sector sized IO buffer.
> > > > ....
> > > >> Use page_frag_alloc() to allocate the sector sized buffer, then the
> > > >> above issue can be fixed because offset_in_page of allocated buffer
> > > >> is always sector aligned.
> > > > 
> > > > Didn't we already reject this approach because page frags cannot be
> > > > reused and that pages allocated to the frag pool are pinned in
> > > > memory until all fragments allocated on the page have been freed?
> > > 
> > > I don't know if you did, but it's certainly true., Also I don't think
> > > there's any specified alignment guarantee for page_frag_alloc().
> > 
> > We did, and the alignment guarantee would have come from all
> > fragments having an aligned size.
> > 
> > > What about kmem_cache_create() with align parameter? That *should* be
> > > guaranteed regardless of whatever debugging is enabled - if not, I would
> > > consider it a bug.
> > 
> > Yup, that's pretty much what was decided. The sticking point was
> > whether is should be block layer infrastructure (because the actual
> > memory buffer alignment is a block/device driver requirement not
> > visible to the filesystem) or whether "sector size alignement is
> > good enough for everyone".
> 
> OK, looks I miss the long life time of meta data caching, then let's
> discuss the slab approach.
> 
> Looks one single slab cache doesn't work, given the size may be 512 * N
> (1 <= N < PAGE_SIZE/512), that is basically what I posted the first
> time.
> 
> https://marc.info/?t=153986884900007&r=1&w=2
> https://marc.info/?t=153986885100001&r=1&w=2
> 
> Or what is the exact size of sub-page IO in xfs most of time? For

Determined by mkfs parameters. Any power of 2 between 512 bytes and
64kB needs to be supported. e.g:

# mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....

will have metadata that is sector sized (512 bytes), filesystem
block sized (1k), directory block sized (8k) and inode cluster sized
(32k), and will use all of them in large quantities.

> example, if 99% times falls in 512 byte allocation, maybe it is enough
> to just maintain one 512byte slab.

It is not. On a 64k page size machine, we use sub page slabs for
metadata blocks of 2^N bytes where 9 <= N <= 15..

Cheers,

Dave.

Matthew Wilcox Feb. 26, 2019, 3:27 a.m. UTC | #9

On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> > Or what is the exact size of sub-page IO in xfs most of time? For
> 
> Determined by mkfs parameters. Any power of 2 between 512 bytes and
> 64kB needs to be supported. e.g:
> 
> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> 
> will have metadata that is sector sized (512 bytes), filesystem
> block sized (1k), directory block sized (8k) and inode cluster sized
> (32k), and will use all of them in large quantities.

If XFS is going to use each of these in large quantities, then it doesn't
seem unreasonable for XFS to create a slab for each type of metadata?

Dave Chinner Feb. 26, 2019, 4:58 a.m. UTC | #10

On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> > > Or what is the exact size of sub-page IO in xfs most of time? For
> > 
> > Determined by mkfs parameters. Any power of 2 between 512 bytes and
> > 64kB needs to be supported. e.g:
> > 
> > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> > 
> > will have metadata that is sector sized (512 bytes), filesystem
> > block sized (1k), directory block sized (8k) and inode cluster sized
> > (32k), and will use all of them in large quantities.
> 
> If XFS is going to use each of these in large quantities, then it doesn't
> seem unreasonable for XFS to create a slab for each type of metadata?

Well, that is the question, isn't it? How many other filesystems
will want to make similar "don't use entire pages just for 4k of
metadata" optimisations as 64k page size machines become more
common? There are others that have the same "use slab for sector
aligned IO" which will fall foul of the same problem that has been
reported for XFS....

If nobody else cares/wants it, then it can be XFS only. But it's
only fair we address the "will it be useful to others" question
first.....

-Dave.

Ming Lei Feb. 26, 2019, 9:33 a.m. UTC | #11

On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote:
> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> > > > Or what is the exact size of sub-page IO in xfs most of time? For
> > > 
> > > Determined by mkfs parameters. Any power of 2 between 512 bytes and
> > > 64kB needs to be supported. e.g:
> > > 
> > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> > > 
> > > will have metadata that is sector sized (512 bytes), filesystem
> > > block sized (1k), directory block sized (8k) and inode cluster sized
> > > (32k), and will use all of them in large quantities.
> > 
> > If XFS is going to use each of these in large quantities, then it doesn't
> > seem unreasonable for XFS to create a slab for each type of metadata?
> 
> 
> Well, that is the question, isn't it? How many other filesystems
> will want to make similar "don't use entire pages just for 4k of
> metadata" optimisations as 64k page size machines become more
> common? There are others that have the same "use slab for sector
> aligned IO" which will fall foul of the same problem that has been
> reported for XFS....
> 
> If nobody else cares/wants it, then it can be XFS only. But it's
> only fair we address the "will it be useful to others" question
> first.....

This kind of slab cache should have been global, just like interface of
kmalloc(size).

However, the alignment requirement depends on block device's block size,
then it becomes hard to implement as genera interface, for example:

	block size: 512, 1024, 2048, 4096
	slab size: 512*N, 0 < N < PAGE_SIZE/512

For 4k page size, 28(7*4) slabs need to be created, and 64k page size
needs to create 127*4 slabs.

But, specific file system may only use some of them, and it depends
on meta data size.

Thanks,
Ming

Vlastimil Babka Feb. 26, 2019, 10:06 a.m. UTC | #12

On 2/26/19 10:33 AM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote:
>> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
>>> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
>>>>> Or what is the exact size of sub-page IO in xfs most of time? For
>>>>
>>>> Determined by mkfs parameters. Any power of 2 between 512 bytes and
>>>> 64kB needs to be supported. e.g:
>>>>
>>>> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
>>>>
>>>> will have metadata that is sector sized (512 bytes), filesystem
>>>> block sized (1k), directory block sized (8k) and inode cluster sized
>>>> (32k), and will use all of them in large quantities.
>>>
>>> If XFS is going to use each of these in large quantities, then it doesn't
>>> seem unreasonable for XFS to create a slab for each type of metadata?
>>
>>
>> Well, that is the question, isn't it? How many other filesystems
>> will want to make similar "don't use entire pages just for 4k of
>> metadata" optimisations as 64k page size machines become more
>> common? There are others that have the same "use slab for sector
>> aligned IO" which will fall foul of the same problem that has been
>> reported for XFS....
>>
>> If nobody else cares/wants it, then it can be XFS only. But it's
>> only fair we address the "will it be useful to others" question
>> first.....
> 
> This kind of slab cache should have been global, just like interface of
> kmalloc(size).
> 
> However, the alignment requirement depends on block device's block size,
> then it becomes hard to implement as genera interface, for example:
> 
> 	block size: 512, 1024, 2048, 4096
> 	slab size: 512*N, 0 < N < PAGE_SIZE/512
> 
> For 4k page size, 28(7*4) slabs need to be created, and 64k page size
> needs to create 127*4 slabs.
>

Where does the '*4' multiplier come from?

So I wonder how hard would it actually be (+CC slab maintainers) to just
guarantee generic kmalloc() alignment for power-of-two sizes. If we can
do that for kmem_cache_create() then the code should be already there.
AFAIK the alignment happens anyway (albeit not guaranteed) in the
non-debug cases, and if guaranteeing alignment for certain debugging
configurations (that need some space before the object) means larger
memory overhead, then the cost should still be bearable since its
non-standard configuration where the point is to catch bug and not have
peak performance?

> But, specific file system may only use some of them, and it depends
> on meta data size.
> 
> Thanks,
> Ming
>

Ming Lei Feb. 26, 2019, 11:12 a.m. UTC | #13

On Tue, Feb 26, 2019 at 6:07 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/26/19 10:33 AM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote:
> >> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
> >>> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> >>>>> Or what is the exact size of sub-page IO in xfs most of time? For
> >>>>
> >>>> Determined by mkfs parameters. Any power of 2 between 512 bytes and
> >>>> 64kB needs to be supported. e.g:
> >>>>
> >>>> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> >>>>
> >>>> will have metadata that is sector sized (512 bytes), filesystem
> >>>> block sized (1k), directory block sized (8k) and inode cluster sized
> >>>> (32k), and will use all of them in large quantities.
> >>>
> >>> If XFS is going to use each of these in large quantities, then it doesn't
> >>> seem unreasonable for XFS to create a slab for each type of metadata?
> >>
> >>
> >> Well, that is the question, isn't it? How many other filesystems
> >> will want to make similar "don't use entire pages just for 4k of
> >> metadata" optimisations as 64k page size machines become more
> >> common? There are others that have the same "use slab for sector
> >> aligned IO" which will fall foul of the same problem that has been
> >> reported for XFS....
> >>
> >> If nobody else cares/wants it, then it can be XFS only. But it's
> >> only fair we address the "will it be useful to others" question
> >> first.....
> >
> > This kind of slab cache should have been global, just like interface of
> > kmalloc(size).
> >
> > However, the alignment requirement depends on block device's block size,
> > then it becomes hard to implement as genera interface, for example:
> >
> >       block size: 512, 1024, 2048, 4096
> >       slab size: 512*N, 0 < N < PAGE_SIZE/512
> >
> > For 4k page size, 28(7*4) slabs need to be created, and 64k page size
> > needs to create 127*4 slabs.
> >
>
> Where does the '*4' multiplier come from?

The buffer needs to be device block size aligned for dio, and now the block
size can be 512, 1024, 2048 and 4096.

Thanks,
Ming Lei

Matthew Wilcox Feb. 26, 2019, 12:12 p.m. UTC | #14

On Tue, Feb 26, 2019 at 07:12:49PM +0800, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 6:07 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 2/26/19 10:33 AM, Ming Lei wrote:
> > > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote:
> > >> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
> > >>> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> > >>>>> Or what is the exact size of sub-page IO in xfs most of time? For
> > >>>>
> > >>>> Determined by mkfs parameters. Any power of 2 between 512 bytes and
> > >>>> 64kB needs to be supported. e.g:
> > >>>>
> > >>>> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> > >>>>
> > >>>> will have metadata that is sector sized (512 bytes), filesystem
> > >>>> block sized (1k), directory block sized (8k) and inode cluster sized
> > >>>> (32k), and will use all of them in large quantities.
> > >>>
> > >>> If XFS is going to use each of these in large quantities, then it doesn't
> > >>> seem unreasonable for XFS to create a slab for each type of metadata?
> > >>
> > >>
> > >> Well, that is the question, isn't it? How many other filesystems
> > >> will want to make similar "don't use entire pages just for 4k of
> > >> metadata" optimisations as 64k page size machines become more
> > >> common? There are others that have the same "use slab for sector
> > >> aligned IO" which will fall foul of the same problem that has been
> > >> reported for XFS....
> > >>
> > >> If nobody else cares/wants it, then it can be XFS only. But it's
> > >> only fair we address the "will it be useful to others" question
> > >> first.....
> > >
> > > This kind of slab cache should have been global, just like interface of
> > > kmalloc(size).
> > >
> > > However, the alignment requirement depends on block device's block size,
> > > then it becomes hard to implement as genera interface, for example:
> > >
> > >       block size: 512, 1024, 2048, 4096
> > >       slab size: 512*N, 0 < N < PAGE_SIZE/512
> > >
> > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size
> > > needs to create 127*4 slabs.
> > >
> >
> > Where does the '*4' multiplier come from?
> 
> The buffer needs to be device block size aligned for dio, and now the block
> size can be 512, 1024, 2048 and 4096.

Why does the block size make a difference?  This requirement is due to
some storage devices having shoddy DMA controllers.  Are you saying there
are devices which can't even do 512-byte aligned I/O?

Ming Lei Feb. 26, 2019, 12:35 p.m. UTC | #15

On Tue, Feb 26, 2019 at 04:12:09AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 26, 2019 at 07:12:49PM +0800, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 6:07 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > On 2/26/19 10:33 AM, Ming Lei wrote:
> > > > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote:
> > > >> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
> > > >>> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> > > >>>>> Or what is the exact size of sub-page IO in xfs most of time? For
> > > >>>>
> > > >>>> Determined by mkfs parameters. Any power of 2 between 512 bytes and
> > > >>>> 64kB needs to be supported. e.g:
> > > >>>>
> > > >>>> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> > > >>>>
> > > >>>> will have metadata that is sector sized (512 bytes), filesystem
> > > >>>> block sized (1k), directory block sized (8k) and inode cluster sized
> > > >>>> (32k), and will use all of them in large quantities.
> > > >>>
> > > >>> If XFS is going to use each of these in large quantities, then it doesn't
> > > >>> seem unreasonable for XFS to create a slab for each type of metadata?
> > > >>
> > > >>
> > > >> Well, that is the question, isn't it? How many other filesystems
> > > >> will want to make similar "don't use entire pages just for 4k of
> > > >> metadata" optimisations as 64k page size machines become more
> > > >> common? There are others that have the same "use slab for sector
> > > >> aligned IO" which will fall foul of the same problem that has been
> > > >> reported for XFS....
> > > >>
> > > >> If nobody else cares/wants it, then it can be XFS only. But it's
> > > >> only fair we address the "will it be useful to others" question
> > > >> first.....
> > > >
> > > > This kind of slab cache should have been global, just like interface of
> > > > kmalloc(size).
> > > >
> > > > However, the alignment requirement depends on block device's block size,
> > > > then it becomes hard to implement as genera interface, for example:
> > > >
> > > >       block size: 512, 1024, 2048, 4096
> > > >       slab size: 512*N, 0 < N < PAGE_SIZE/512
> > > >
> > > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size
> > > > needs to create 127*4 slabs.
> > > >
> > >
> > > Where does the '*4' multiplier come from?
> > 
> > The buffer needs to be device block size aligned for dio, and now the block
> > size can be 512, 1024, 2048 and 4096.
> 
> Why does the block size make a difference?  This requirement is due to
> some storage devices having shoddy DMA controllers.  Are you saying there
> are devices which can't even do 512-byte aligned I/O?

Direct IO requires that, see do_blockdev_direct_IO().

This issue can be triggered when running xfs over loop/dio. We could
fallback to buffered IO under this situation, but not sure it is the
only case.


Thanks,
Ming

Matthew Wilcox Feb. 26, 2019, 1:02 p.m. UTC | #16

On Tue, Feb 26, 2019 at 08:35:46PM +0800, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 04:12:09AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 26, 2019 at 07:12:49PM +0800, Ming Lei wrote:
> > > The buffer needs to be device block size aligned for dio, and now the block
> > > size can be 512, 1024, 2048 and 4096.
> > 
> > Why does the block size make a difference?  This requirement is due to
> > some storage devices having shoddy DMA controllers.  Are you saying there
> > are devices which can't even do 512-byte aligned I/O?
> 
> Direct IO requires that, see do_blockdev_direct_IO().
> 
> This issue can be triggered when running xfs over loop/dio. We could
> fallback to buffered IO under this situation, but not sure it is the
> only case.

Wait, we're imposing a ridiculous amount of complexity on XFS for no
reason at all?  We should just change this to 512-byte alignment.  Tying
it to the blocksize of the device never made any sense.

Ming Lei Feb. 26, 2019, 1:42 p.m. UTC | #17

On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 26, 2019 at 08:35:46PM +0800, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 04:12:09AM -0800, Matthew Wilcox wrote:
> > > On Tue, Feb 26, 2019 at 07:12:49PM +0800, Ming Lei wrote:
> > > > The buffer needs to be device block size aligned for dio, and now the block
> > > > size can be 512, 1024, 2048 and 4096.
> > > 
> > > Why does the block size make a difference?  This requirement is due to
> > > some storage devices having shoddy DMA controllers.  Are you saying there
> > > are devices which can't even do 512-byte aligned I/O?
> > 
> > Direct IO requires that, see do_blockdev_direct_IO().
> > 
> > This issue can be triggered when running xfs over loop/dio. We could
> > fallback to buffered IO under this situation, but not sure it is the
> > only case.
> 
> Wait, we're imposing a ridiculous amount of complexity on XFS for no
> reason at all?  We should just change this to 512-byte alignment.  Tying
> it to the blocksize of the device never made any sense.

OK, that is fine since we can fallback to buffered IO for loop in case of
unaligned dio.

Then something like the following patch should work for all fs, could
anyone comment on this approach?

--

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..76f09f23a410 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -405,3 +405,44 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+static struct kmem_cache *sector_buf_slabs[(PAGE_SIZE >> 9) - 1];
+
+void *blk_alloc_sec_buf(unsigned size, gfp_t flags)
+{
+	int idx;
+
+	size = round_up(size, 512);
+	if (size >= PAGE_SIZE)
+		return NULL;
+
+	idx = (size >> 9) - 1;
+	if (!sector_buf_slabs[idx])
+		return NULL;
+	return kmem_cache_alloc(sector_buf_slabs[idx], flags);
+}
+EXPORT_SYMBOL_GPL(blk_alloc_sec_buf);
+
+void blk_free_sec_buf(void *buf, int size)
+{
+	size = round_up(size, 512);
+	if (size >= PAGE_SIZE)
+		return;
+
+	return kmem_cache_free(sector_buf_slabs[(size >> 9) - 1], buf);
+}
+EXPORT_SYMBOL_GPL(blk_free_sec_buf);
+
+void __init blk_sector_buf_init(void)
+{
+	unsigned size;
+
+	for (size = 512; size < PAGE_SIZE; size += 512) {
+		char name[16];
+		int idx = (size >> 9) - 1;
+
+		snprintf(name, 16, "blk_sec_buf-%u", size);
+		sector_buf_slabs[idx] = kmem_cache_create(name, size, 512,
+							  SLAB_PANIC, NULL);
+	}
+}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index faed9d9eb84c..a4117e526715 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1657,6 +1657,9 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
 
+extern void *blk_alloc_sec_buf(unsigned size, gfp_t flags);
+extern void blk_free_sec_buf(void *buf, int size);
+
 #ifdef CONFIG_BLK_DEV_ZONED
 bool blk_req_needs_zone_write_lock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
@@ -1755,6 +1758,15 @@ static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 	return 0;
 }
 
+static inline void *blk_alloc_sec_buf(unsigned size, gfp_t flags)
+{
+	return NULL;
+}
+
+static inline void blk_free_sec_buf(void *buf, int size)
+{
+}
+
 #endif /* CONFIG_BLOCK */
 
 static inline void blk_wake_io_task(struct task_struct *waiter)

Thanks,
Ming

Matthew Wilcox Feb. 26, 2019, 2:04 p.m. UTC | #18

On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote:
> > Wait, we're imposing a ridiculous amount of complexity on XFS for no
> > reason at all?  We should just change this to 512-byte alignment.  Tying
> > it to the blocksize of the device never made any sense.
> 
> OK, that is fine since we can fallback to buffered IO for loop in case of
> unaligned dio.
> 
> Then something like the following patch should work for all fs, could
> anyone comment on this approach?

That's not even close to what I meant.

diff --git a/fs/direct-io.c b/fs/direct-io.c
index ec2fb6fe6d37..dee1fc47a7fc 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	struct dio_submit sdio = { 0, };
 	struct buffer_head map_bh = { 0, };
 	struct blk_plug plug;
-	unsigned long align = offset | iov_iter_alignment(iter);
 
 	/*
 	 * Avoid references to bdev if not absolutely needed to give
 	 * the early prefetch in the caller enough time.
 	 */
 
-	if (align & blocksize_mask) {
+	if (iov_iter_alignment(iter) & 511)
+		goto out;
+
+	if (offset & blocksize_mask) {
 		if (bdev)
 			blkbits = blksize_bits(bdev_logical_block_size(bdev));
 		blocksize_mask = (1 << blkbits) - 1;
-		if (align & blocksize_mask)
+		if (offset & blocksize_mask)
 			goto out;
 	}

Christoph Lameter (Ampere) Feb. 26, 2019, 3:20 p.m. UTC | #19

On Mon, 25 Feb 2019, Vlastimil Babka wrote:

> What about kmem_cache_create() with align parameter? That *should* be
> guaranteed regardless of whatever debugging is enabled - if not, I would
> consider it a bug.

It definitely guarantees that. What would be the point of the alignment
parameter otherwise?

Christoph Lameter (Ampere) Feb. 26, 2019, 3:30 p.m. UTC | #20

On Tue, 26 Feb 2019, Ming Lei wrote:

> Then something like the following patch should work for all fs, could
> anyone comment on this approach?

Note that various subsystems have similar implementations. Have a look at

drivers/dma/dmaengine.c

struct dmaengine_unmap_pool {
        struct kmem_cache *cache;
        const char *name;
        mempool_t *pool;
        size_t size;
};

#define __UNMAP_POOL(x) { .size = x, .name = "dmaengine-unmap-"
__stringify(x) }
static struct dmaengine_unmap_pool unmap_pool[] = {
        __UNMAP_POOL(2),
        #if IS_ENABLED(CONFIG_DMA_ENGINE_RAID)
        __UNMAP_POOL(16),
        __UNMAP_POOL(128),
        __UNMAP_POOL(256),
        #endif
};

Or drivers/md/dm-bufio.c:

struct dm_bufio_client {
        struct mutex lock;

        struct list_head lru[LIST_SIZE];
        unsigned long n_buffers[LIST_SIZE];

        struct block_device *bdev;
        unsigned block_size;
        s8 sectors_per_block_bits;
        void (*alloc_callback)(struct dm_buffer *);
        void (*write_callback)(struct dm_buffer *);

        struct kmem_cache *slab_buffer;
        struct kmem_cache *slab_cache;
        struct dm_io_client *dm_io;

        struct list_head reserved_buffers;
        unsigned need_reserved_buffers;

        unsigned minimum_buffers;

        struct rb_root buffer_tree;
        wait_queue_head_t free_buffer_wait;

        sector_t start;

        int async_write_error;

        struct list_head client_list;
        struct shrinker shrinker;
};

Darrick J. Wong Feb. 26, 2019, 4:14 p.m. UTC | #21

On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote:
> > > Wait, we're imposing a ridiculous amount of complexity on XFS for no
> > > reason at all?  We should just change this to 512-byte alignment.  Tying
> > > it to the blocksize of the device never made any sense.
> > 
> > OK, that is fine since we can fallback to buffered IO for loop in case of
> > unaligned dio.
> > 
> > Then something like the following patch should work for all fs, could
> > anyone comment on this approach?
> 
> That's not even close to what I meant.
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index ec2fb6fe6d37..dee1fc47a7fc 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,

Wait a minute, are you all saying that /directio/ is broken on XFS too??
XFS doesn't use blockdev_direct_IO anymore.

I thought we were talking about alignment of XFS metadata buffers
(xfs_buf.c), which is a very different topic.

As I understand the problem, in non-debug mode the slab caches give
xfs_buf chunks of memory that are aligned well enough to work, but in
debug mode the slabs allocate slightly more bytes to carry debug
information which pushes the returned address up slightly, thus breaking
the alignment requirements.

So why can't we just move the debug info to the end of the object?  If
our 512 byte allocation turns into a (512 + a few more) bytes we'll end
up using 1024 bytes on the allocation regardless, so it shouldn't matter
to put the debug info at offset 512.  If the reason is fear that kernel
code will scribble off the end of the object, then return (*obj + 512).
Maybe you all have already covered this, though?

--D

>  	struct dio_submit sdio = { 0, };
>  	struct buffer_head map_bh = { 0, };
>  	struct blk_plug plug;
> -	unsigned long align = offset | iov_iter_alignment(iter);
>  
>  	/*
>  	 * Avoid references to bdev if not absolutely needed to give
>  	 * the early prefetch in the caller enough time.
>  	 */
>  
> -	if (align & blocksize_mask) {
> +	if (iov_iter_alignment(iter) & 511)
> +		goto out;
> +
> +	if (offset & blocksize_mask) {
>  		if (bdev)
>  			blkbits = blksize_bits(bdev_logical_block_size(bdev));
>  		blocksize_mask = (1 << blkbits) - 1;
> -		if (align & blocksize_mask)
> +		if (offset & blocksize_mask)
>  			goto out;
>  	}
>

Matthew Wilcox Feb. 26, 2019, 4:19 p.m. UTC | #22

On Tue, Feb 26, 2019 at 08:14:33AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote:
> > > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote:
> > > > Wait, we're imposing a ridiculous amount of complexity on XFS for no
> > > > reason at all?  We should just change this to 512-byte alignment.  Tying
> > > > it to the blocksize of the device never made any sense.
> > > 
> > > OK, that is fine since we can fallback to buffered IO for loop in case of
> > > unaligned dio.
> > > 
> > > Then something like the following patch should work for all fs, could
> > > anyone comment on this approach?
> > 
> > That's not even close to what I meant.
> > 
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index ec2fb6fe6d37..dee1fc47a7fc 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> 
> Wait a minute, are you all saying that /directio/ is broken on XFS too??
> XFS doesn't use blockdev_direct_IO anymore.
> 
> I thought we were talking about alignment of XFS metadata buffers
> (xfs_buf.c), which is a very different topic.
> 
> As I understand the problem, in non-debug mode the slab caches give
> xfs_buf chunks of memory that are aligned well enough to work, but in
> debug mode the slabs allocate slightly more bytes to carry debug
> information which pushes the returned address up slightly, thus breaking
> the alignment requirements.
> 
> So why can't we just move the debug info to the end of the object?  If
> our 512 byte allocation turns into a (512 + a few more) bytes we'll end
> up using 1024 bytes on the allocation regardless, so it shouldn't matter
> to put the debug info at offset 512.  If the reason is fear that kernel
> code will scribble off the end of the object, then return (*obj + 512).
> Maybe you all have already covered this, though?

I don't know _what_ Ming Lei is saying.  I thought the problem was
with slab redzones, which need to be before and after each object,
but apparently the problem is with KASAN as well.

Dave Chinner Feb. 26, 2019, 8:45 p.m. UTC | #23

On Tue, Feb 26, 2019 at 05:33:04PM +0800, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote:
> > On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
> > > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> > > > > Or what is the exact size of sub-page IO in xfs most of time? For
> > > > 
> > > > Determined by mkfs parameters. Any power of 2 between 512 bytes and
> > > > 64kB needs to be supported. e.g:
> > > > 
> > > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> > > > 
> > > > will have metadata that is sector sized (512 bytes), filesystem
> > > > block sized (1k), directory block sized (8k) and inode cluster sized
> > > > (32k), and will use all of them in large quantities.
> > > 
> > > If XFS is going to use each of these in large quantities, then it doesn't
> > > seem unreasonable for XFS to create a slab for each type of metadata?
> > 
> > 
> > Well, that is the question, isn't it? How many other filesystems
> > will want to make similar "don't use entire pages just for 4k of
> > metadata" optimisations as 64k page size machines become more
> > common? There are others that have the same "use slab for sector
> > aligned IO" which will fall foul of the same problem that has been
> > reported for XFS....
> > 
> > If nobody else cares/wants it, then it can be XFS only. But it's
> > only fair we address the "will it be useful to others" question
> > first.....
> 
> This kind of slab cache should have been global, just like interface of
> kmalloc(size).
> 
> However, the alignment requirement depends on block device's block size,
> then it becomes hard to implement as genera interface, for example:
> 
> 	block size: 512, 1024, 2048, 4096
> 	slab size: 512*N, 0 < N < PAGE_SIZE/512
> 
> For 4k page size, 28(7*4) slabs need to be created, and 64k page size
> needs to create 127*4 slabs.

IDGI. Where's the 7/127 come from?

We only require sector alignment at most, so as long as each slab
object is aligned to it's size, we only need one slab for each block
size.

Cheers,

Dave.

Ming Lei Feb. 27, 2019, 1:41 a.m. UTC | #24

On Tue, Feb 26, 2019 at 08:19:12AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 26, 2019 at 08:14:33AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote:
> > > On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote:
> > > > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote:
> > > > > Wait, we're imposing a ridiculous amount of complexity on XFS for no
> > > > > reason at all?  We should just change this to 512-byte alignment.  Tying
> > > > > it to the blocksize of the device never made any sense.
> > > > 
> > > > OK, that is fine since we can fallback to buffered IO for loop in case of
> > > > unaligned dio.
> > > > 
> > > > Then something like the following patch should work for all fs, could
> > > > anyone comment on this approach?
> > > 
> > > That's not even close to what I meant.
> > > 
> > > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > > index ec2fb6fe6d37..dee1fc47a7fc 100644
> > > --- a/fs/direct-io.c
> > > +++ b/fs/direct-io.c
> > > @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> > 
> > Wait a minute, are you all saying that /directio/ is broken on XFS too??
> > XFS doesn't use blockdev_direct_IO anymore.
> > 
> > I thought we were talking about alignment of XFS metadata buffers
> > (xfs_buf.c), which is a very different topic.
> > 
> > As I understand the problem, in non-debug mode the slab caches give
> > xfs_buf chunks of memory that are aligned well enough to work, but in
> > debug mode the slabs allocate slightly more bytes to carry debug
> > information which pushes the returned address up slightly, thus breaking
> > the alignment requirements.
> > 
> > So why can't we just move the debug info to the end of the object?  If
> > our 512 byte allocation turns into a (512 + a few more) bytes we'll end
> > up using 1024 bytes on the allocation regardless, so it shouldn't matter
> > to put the debug info at offset 512.  If the reason is fear that kernel
> > code will scribble off the end of the object, then return (*obj + 512).
> > Maybe you all have already covered this, though?
> 
> I don't know _what_ Ming Lei is saying.  I thought the problem was
> with slab redzones, which need to be before and after each object,
> but apparently the problem is with KASAN as well.

I have mentioned several times that it is triggered on xfs over
loop/dio, however it may be addressed by falling back to buffered IO
in case unaligned buffer.

Please see lo_rw_aio().

Thanks,
Ming

Ming Lei Feb. 27, 2019, 1:50 a.m. UTC | #25

On Wed, Feb 27, 2019 at 07:45:50AM +1100, Dave Chinner wrote:
> On Tue, Feb 26, 2019 at 05:33:04PM +0800, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote:
> > > On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
> > > > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> > > > > > Or what is the exact size of sub-page IO in xfs most of time? For
> > > > > 
> > > > > Determined by mkfs parameters. Any power of 2 between 512 bytes and
> > > > > 64kB needs to be supported. e.g:
> > > > > 
> > > > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> > > > > 
> > > > > will have metadata that is sector sized (512 bytes), filesystem
> > > > > block sized (1k), directory block sized (8k) and inode cluster sized
> > > > > (32k), and will use all of them in large quantities.
> > > > 
> > > > If XFS is going to use each of these in large quantities, then it doesn't
> > > > seem unreasonable for XFS to create a slab for each type of metadata?
> > > 
> > > 
> > > Well, that is the question, isn't it? How many other filesystems
> > > will want to make similar "don't use entire pages just for 4k of
> > > metadata" optimisations as 64k page size machines become more
> > > common? There are others that have the same "use slab for sector
> > > aligned IO" which will fall foul of the same problem that has been
> > > reported for XFS....
> > > 
> > > If nobody else cares/wants it, then it can be XFS only. But it's
> > > only fair we address the "will it be useful to others" question
> > > first.....
> > 
> > This kind of slab cache should have been global, just like interface of
> > kmalloc(size).
> > 
> > However, the alignment requirement depends on block device's block size,
> > then it becomes hard to implement as genera interface, for example:
> > 
> > 	block size: 512, 1024, 2048, 4096
> > 	slab size: 512*N, 0 < N < PAGE_SIZE/512
> > 
> > For 4k page size, 28(7*4) slabs need to be created, and 64k page size
> > needs to create 127*4 slabs.
> 
> IDGI. Where's the 7/127 come from?
> 
> We only require sector alignment at most, so as long as each slab
> object is aligned to it's size, we only need one slab for each block
> size.

Each slab has fixed size, I remembered that you mentioned that the meta
data size can be 512 * N (1 <= N <= PAGE_SIZE / 512).

https://marc.info/?l=linux-fsdevel&m=155115014513355&w=2


Thanks, 
Ming

Dave Chinner Feb. 27, 2019, 3:41 a.m. UTC | #26

On Wed, Feb 27, 2019 at 09:50:55AM +0800, Ming Lei wrote:
> On Wed, Feb 27, 2019 at 07:45:50AM +1100, Dave Chinner wrote:
> > On Tue, Feb 26, 2019 at 05:33:04PM +0800, Ming Lei wrote:
> > > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote:
> > > > On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote:
> > > > > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote:
> > > > > > > Or what is the exact size of sub-page IO in xfs most of time? For
> > > > > > 
> > > > > > Determined by mkfs parameters. Any power of 2 between 512 bytes and
> > > > > > 64kB needs to be supported. e.g:
> > > > > > 
> > > > > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k ....
> > > > > > 
> > > > > > will have metadata that is sector sized (512 bytes), filesystem
> > > > > > block sized (1k), directory block sized (8k) and inode cluster sized
> > > > > > (32k), and will use all of them in large quantities.
> > > > > 
> > > > > If XFS is going to use each of these in large quantities, then it doesn't
> > > > > seem unreasonable for XFS to create a slab for each type of metadata?
> > > > 
> > > > 
> > > > Well, that is the question, isn't it? How many other filesystems
> > > > will want to make similar "don't use entire pages just for 4k of
> > > > metadata" optimisations as 64k page size machines become more
> > > > common? There are others that have the same "use slab for sector
> > > > aligned IO" which will fall foul of the same problem that has been
> > > > reported for XFS....
> > > > 
> > > > If nobody else cares/wants it, then it can be XFS only. But it's
> > > > only fair we address the "will it be useful to others" question
> > > > first.....
> > > 
> > > This kind of slab cache should have been global, just like interface of
> > > kmalloc(size).
> > > 
> > > However, the alignment requirement depends on block device's block size,
> > > then it becomes hard to implement as genera interface, for example:
> > > 
> > > 	block size: 512, 1024, 2048, 4096
> > > 	slab size: 512*N, 0 < N < PAGE_SIZE/512
> > > 
> > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size
> > > needs to create 127*4 slabs.
> > 
> > IDGI. Where's the 7/127 come from?
> > 
> > We only require sector alignment at most, so as long as each slab
> > object is aligned to it's size, we only need one slab for each block
> > size.
> 
> Each slab has fixed size, I remembered that you mentioned that the meta
> data size can be 512 * N (1 <= N <= PAGE_SIZE / 512).
> 
> https://marc.info/?l=linux-fsdevel&m=155115014513355&w=2

nggggh. 

*That* *is* *not* *what* *I* *said*.

That is *what you said* and I said that was wrong and the actual
sizes needed were:

dgc> It is not. On a 64k page size machine, we use sub page slabs for
dgc> metadata blocks of 2^N bytes where 9 <= N <= 15..

Please do the maths. I shoul dnot have to do it for you.

Also, please don't confusing sector-in-LBA alignment with /memory
buffer/ alignment. i.e. you're assuming that these 2kB IOs at
different sector offsets like the following:

 0	0.5	1.0	1.5	2.0	2.5	3.0	3.5	4.0
 +-------+-------+-------+-------+-------+-------+-------+-------+
 +ooooooooooooooooooooooooooooooo+
	 +ooooooooooooooooooooooooooooooo+
		 +ooooooooooooooooooooooooooooooo+
			 +ooooooooooooooooooooooooooooooo+
				 +ooooooooooooooooooooooooooooooo+

require a memory buffer alignment that matches the sector-in-LBA
alignment. This is wrong - the memory alignment constraint is
usually a hardware DMA engine limitation and has nothing to do with
the LBA the IO will place/retreive the data in/from.

i.e. An aligned 2kB slab will only allocate 2kB slab objects like so:

 0	0.5	1.0	1.5	2.0	2.5	3.0	3.5	4.0
 +-------+-------+-------+-------+-------+-------+-------+-------+
 +ooooooooooooooooooooooooooooooo+
				 +ooooooooooooooooooooooooooooooo+

and these memory buffers will always have 512 byte alignment. This
meets the memory alignemnt requirements of all hardware (which is
512 byte alignment or smaller), and so we only need one slab per
size.

Cheers,

Dave.

Vlastimil Babka Feb. 27, 2019, 7:07 a.m. UTC | #27

On 2/26/19 5:19 PM, Matthew Wilcox wrote:
> On Tue, Feb 26, 2019 at 08:14:33AM -0800, Darrick J. Wong wrote:
>> On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote:
>> Wait a minute, are you all saying that /directio/ is broken on XFS too??
>> XFS doesn't use blockdev_direct_IO anymore.
>>
>> I thought we were talking about alignment of XFS metadata buffers
>> (xfs_buf.c), which is a very different topic.
>>
>> As I understand the problem, in non-debug mode the slab caches give
>> xfs_buf chunks of memory that are aligned well enough to work, but in
>> debug mode the slabs allocate slightly more bytes to carry debug
>> information which pushes the returned address up slightly, thus breaking
>> the alignment requirements.
>>
>> So why can't we just move the debug info to the end of the object?  If
>> our 512 byte allocation turns into a (512 + a few more) bytes we'll end
>> up using 1024 bytes on the allocation regardless, so it shouldn't matter
>> to put the debug info at offset 512.  If the reason is fear that kernel
>> code will scribble off the end of the object, then return (*obj + 512).
>> Maybe you all have already covered this, though?
> 
> I don't know _what_ Ming Lei is saying.  I thought the problem was
> with slab redzones, which need to be before and after each object,
> but apparently the problem is with KASAN as well.

That's what I thought as well. But if we can solve it for caches created
by kmem_cache_create(..., align, ...) then IMHO we could guarantee
natural alignment for power-of-two kmalloc caches as well.

Dave Chinner Feb. 27, 2019, 9:38 p.m. UTC | #28

On Tue, Feb 26, 2019 at 08:14:33AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote:
> > > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote:
> > > > Wait, we're imposing a ridiculous amount of complexity on XFS for no
> > > > reason at all?  We should just change this to 512-byte alignment.  Tying
> > > > it to the blocksize of the device never made any sense.
> > > 
> > > OK, that is fine since we can fallback to buffered IO for loop in case of
> > > unaligned dio.
> > > 
> > > Then something like the following patch should work for all fs, could
> > > anyone comment on this approach?
> > 
> > That's not even close to what I meant.
> > 
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index ec2fb6fe6d37..dee1fc47a7fc 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> 
> Wait a minute, are you all saying that /directio/ is broken on XFS too??
> XFS doesn't use blockdev_direct_IO anymore.

No, loop devices w/ direct IO is a complete red herring. It's the
same issue - the upper filesystem is sending down unaligned bios
to the loop device, which is then just passing them on to the
underlying filesystem via DIO, and the DIO code in the lower
filesystem saying "that user memory buffer ain't aligned to my
logical sector size" and rejecting it.

Actually, in XFS's case, it doesn't care about memory buffer
alignment - it's the iomap code that is rejecting it when mapping
the memory buffer to a bio in iomap_dio_bio_actor():

	unsigned int blkbits = blksize_bits(bdev_logical_block_size(iomap->bdev));
.....
	unsigned int align = iov_iter_alignment(dio->submit.iter);
.....
	if ((pos | length | align) & ((1 << blkbits) - 1))
		return -EINVAL;

IOWs, if the memory buffer is not aligned to the logical block size
of the underlying device (which defaults to 512 bytes) then it will
be rejected by the lower filesystem...

> I thought we were talking about alignment of XFS metadata buffers
> (xfs_buf.c), which is a very different topic.

Yup, it's the same problem, just a different symptom. Fix the
unaligned buffer problem, we fix the loop dev w/ direct-io problem,
too.

FWIW, this "filesystem image sector size mismatch" problem has been
around for many, many years - the same issue that occurs with loop
devices when you try to mount a 512 byte sector image on a hard 4k
sector host filesystem/storage device using direct IO in the loop
device. This isn't a new thing at all - if you want to use direct IO
to manipulate filesystem images, you actually need to know what you
are doing....

Cheers,

Dave.

Christoph Hellwig March 8, 2019, 8:18 a.m. UTC | #29

On Wed, Feb 27, 2019 at 08:07:31AM +0100, Vlastimil Babka wrote:
> > I don't know _what_ Ming Lei is saying.  I thought the problem was
> > with slab redzones, which need to be before and after each object,
> > but apparently the problem is with KASAN as well.
> 
> That's what I thought as well. But if we can solve it for caches created
> by kmem_cache_create(..., align, ...) then IMHO we could guarantee
> natural alignment for power-of-two kmalloc caches as well.

Yes, having a version of kmalloc that guarantees power of two alignment
would be extremely helpful and avoid a lot of pointless boilerplate code.

xfs: allocate sector sized IO buffer via page_frag_alloc

Commit Message

Comments

Patch