diff mbox series

[RFC] xfs: re-enable FIBMAP on reflink; disable for swap

Message ID 2eb759e5-2faa-67fd-5c16-c1d8edc42d02@redhat.com (mailing list archive)
State Superseded
Headers show
Series [RFC] xfs: re-enable FIBMAP on reflink; disable for swap | expand

Commit Message

Eric Sandeen Aug. 30, 2018, 4:10 p.m. UTC
We disabled FIBMAP/bmap calls for reflinked files because swap will then
writes directly to the blocks, bypassing COW mechanisms, and breaking
copy on write.  As noted in commit db1327b, this restriction also breaks
bootloaders that want to use the FIBMAP ioctl.

Rather than disabling the entire mapping interface for everyone just
because swapon may abuse the info, teach xfs_iomap_swapfile_activate()
to reject activation for reflinked files, and re-enable the FIBMAP
interface.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

yes, other FIBMAP users could also abuse this information, but
disallowing mapping because somebody might do something dumb with
it seems very heavy handed... thoughts?

Comments

Christoph Hellwig Aug. 30, 2018, 4:25 p.m. UTC | #1
On Thu, Aug 30, 2018 at 11:10:05AM -0500, Eric Sandeen wrote:
> We disabled FIBMAP/bmap calls for reflinked files because swap will then
> writes directly to the blocks, bypassing COW mechanisms, and breaking
> copy on write.  As noted in commit db1327b, this restriction also breaks
> bootloaders that want to use the FIBMAP ioctl.
> 
> Rather than disabling the entire mapping interface for everyone just
> because swapon may abuse the info, teach xfs_iomap_swapfile_activate()
> to reject activation for reflinked files, and re-enable the FIBMAP
> interface.

Every use of the feature is an abuse, and that includes bootloaders.
By the time you've obtained the information it can (and often will)
be stale.
Eric Sandeen Aug. 30, 2018, 4:31 p.m. UTC | #2
On 8/30/18 11:25 AM, Christoph Hellwig wrote:
> On Thu, Aug 30, 2018 at 11:10:05AM -0500, Eric Sandeen wrote:
>> We disabled FIBMAP/bmap calls for reflinked files because swap will then
>> writes directly to the blocks, bypassing COW mechanisms, and breaking
>> copy on write.  As noted in commit db1327b, this restriction also breaks
>> bootloaders that want to use the FIBMAP ioctl.
>>
>> Rather than disabling the entire mapping interface for everyone just
>> because swapon may abuse the info, teach xfs_iomap_swapfile_activate()
>> to reject activation for reflinked files, and re-enable the FIBMAP
>> interface.
> 
> Every use of the feature is an abuse, and that includes bootloaders.
> By the time you've obtained the information it can (and often will)
> be stale.
> 

That's no reason to uniquely disallow it for reflinked files, though;
the problem is universal.  It's true for fiemap as well.  So I'm not sure
that's an argument against the patch?

-Eric
Eric Sandeen Aug. 30, 2018, 4:35 p.m. UTC | #3
On 8/30/18 11:36 AM, Christoph Hellwig wrote:
> On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
>> That's no reason to uniquely disallow it for reflinked files, though;
>> the problem is universal.  It's true for fiemap as well.  So I'm not sure
>> that's an argument against the patch?
> 
> fiemap at least tells you an extent is shared, bmap does not.

yes, so bmap is clearly the wrong interface to use if you want to
write directly to a file's blocks.  But if you know enough to check
the fiemap shared flag, you know enough to not use fibmap for that purpose...

-Eric
Christoph Hellwig Aug. 30, 2018, 4:36 p.m. UTC | #4
On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
> That's no reason to uniquely disallow it for reflinked files, though;
> the problem is universal.  It's true for fiemap as well.  So I'm not sure
> that's an argument against the patch?

fiemap at least tells you an extent is shared, bmap does not.
Brian Foster Aug. 30, 2018, 6:02 p.m. UTC | #5
On Thu, Aug 30, 2018 at 11:35:46AM -0500, Eric Sandeen wrote:
> On 8/30/18 11:36 AM, Christoph Hellwig wrote:
> > On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
> >> That's no reason to uniquely disallow it for reflinked files, though;
> >> the problem is universal.  It's true for fiemap as well.  So I'm not sure
> >> that's an argument against the patch?
> > 
> > fiemap at least tells you an extent is shared, bmap does not.
> 
> yes, so bmap is clearly the wrong interface to use if you want to
> write directly to a file's blocks.  But if you know enough to check
> the fiemap shared flag, you know enough to not use fibmap for that purpose...
> 

FWIW, this patch seems reasonable to me. To Christoph's point, I don't
think either interface really grants license to write to the underlying
blocks, so either way it's technically being abused for this purpose.
Unless there's a clear way to return an error for a particular type of
file, I think it's reasonable behavior for fibmap to expose the data it
supports (i.e., block maps) and drop the data it doesn't (reflink
state).

Brian

> -Eric
Darrick J. Wong Aug. 30, 2018, 6:28 p.m. UTC | #6
On Thu, Aug 30, 2018 at 02:02:05PM -0400, Brian Foster wrote:
> On Thu, Aug 30, 2018 at 11:35:46AM -0500, Eric Sandeen wrote:
> > On 8/30/18 11:36 AM, Christoph Hellwig wrote:
> > > On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
> > >> That's no reason to uniquely disallow it for reflinked files, though;
> > >> the problem is universal.  It's true for fiemap as well.  So I'm not sure
> > >> that's an argument against the patch?
> > > 
> > > fiemap at least tells you an extent is shared, bmap does not.
> > 
> > yes, so bmap is clearly the wrong interface to use if you want to
> > write directly to a file's blocks.  But if you know enough to check
> > the fiemap shared flag, you know enough to not use fibmap for that purpose...
> > 
> 
> FWIW, this patch seems reasonable to me. To Christoph's point, I don't
> think either interface really grants license to write to the underlying
> blocks, so either way it's technically being abused for this purpose.
> Unless there's a clear way to return an error for a particular type of
> file, I think it's reasonable behavior for fibmap to expose the data it
> supports (i.e., block maps) and drop the data it doesn't (reflink
> state).

But shared block status isn't something that can be dropped lightly.  If
you write to a shared block without realizing it, you'll corrupt every
other file that shares the block.

I prefer to have FIBMAP return errors to *cough* encourage people to use
FIEMAP.  If code are going to abuse the FI[BE]MAP interface they could
at least abuse the one that gives it enough context to avoid fs
corruption.  (A proper fs driver would be preferable, though very
difficult).

Granted, grub's blocklist code doesn't seem to check for shared blocks
when it writes grubenv.... yuck, though TBH I don't have the eye budget
to spend on digging through grub2.  Frankly I think FIBMAP comes verrry
close to "this API is unfixably stupid and shouldn't be enabled for new
use cases and should go away some day".

--D

> Brian
> 
> > -Eric
Eric Sandeen Aug. 30, 2018, 6:51 p.m. UTC | #7
On 8/30/18 1:28 PM, Darrick J. Wong wrote:
> On Thu, Aug 30, 2018 at 02:02:05PM -0400, Brian Foster wrote:
>> On Thu, Aug 30, 2018 at 11:35:46AM -0500, Eric Sandeen wrote:
>>> On 8/30/18 11:36 AM, Christoph Hellwig wrote:
>>>> On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
>>>>> That's no reason to uniquely disallow it for reflinked files, though;
>>>>> the problem is universal.  It's true for fiemap as well.  So I'm not sure
>>>>> that's an argument against the patch?
>>>>
>>>> fiemap at least tells you an extent is shared, bmap does not.
>>>
>>> yes, so bmap is clearly the wrong interface to use if you want to
>>> write directly to a file's blocks.  But if you know enough to check
>>> the fiemap shared flag, you know enough to not use fibmap for that purpose...
>>>
>>
>> FWIW, this patch seems reasonable to me. To Christoph's point, I don't
>> think either interface really grants license to write to the underlying
>> blocks, so either way it's technically being abused for this purpose.
>> Unless there's a clear way to return an error for a particular type of
>> file, I think it's reasonable behavior for fibmap to expose the data it
>> supports (i.e., block maps) and drop the data it doesn't (reflink
>> state).
> 
> But shared block status isn't something that can be dropped lightly.  If
> you write to a shared block without realizing it, you'll corrupt every
> other file that shares the block.

But there is no circumstance under which it is safe to write to a mapped
block no matter how you mapped it, tbh.  This is just singling out one case
of many, and it seems capricious to me.  Other than the blast zone being
possibly larger for reflinked files ... but I just don't think that's our
judgement to make here.

> I prefer to have FIBMAP return errors to *cough* encourage people to use
> FIEMAP.  If code are going to abuse the FI[BE]MAP interface they could
> at least abuse the one that gives it enough context to avoid fs
> corruption.  (A proper fs driver would be preferable, though very
> difficult).
> 
> Granted, grub's blocklist code doesn't seem to check for shared blocks
> when it writes grubenv.... yuck, though TBH I don't have the eye budget
> to spend on digging through grub2.

grub2 doesn't even use fiemap or fibmap so I'm not sure it's relevant
to this decision...

> Frankly I think FIBMAP comes verrry
> close to "this API is unfixably stupid and shouldn't be enabled for new
> use cases and should go away some day".
So instead if anyone asks we'll just give them a successful response which
is indistinguishable from a hole.  :(

I'm pretty strongly of the opinion that we should give the user what
they ask for, and not try to intuit their motives for the question.

-Eric
Brian Foster Aug. 30, 2018, 7:39 p.m. UTC | #8
On Thu, Aug 30, 2018 at 01:51:56PM -0500, Eric Sandeen wrote:
> On 8/30/18 1:28 PM, Darrick J. Wong wrote:
> > On Thu, Aug 30, 2018 at 02:02:05PM -0400, Brian Foster wrote:
> >> On Thu, Aug 30, 2018 at 11:35:46AM -0500, Eric Sandeen wrote:
> >>> On 8/30/18 11:36 AM, Christoph Hellwig wrote:
> >>>> On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
> >>>>> That's no reason to uniquely disallow it for reflinked files, though;
> >>>>> the problem is universal.  It's true for fiemap as well.  So I'm not sure
> >>>>> that's an argument against the patch?
> >>>>
> >>>> fiemap at least tells you an extent is shared, bmap does not.
> >>>
> >>> yes, so bmap is clearly the wrong interface to use if you want to
> >>> write directly to a file's blocks.  But if you know enough to check
> >>> the fiemap shared flag, you know enough to not use fibmap for that purpose...
> >>>
> >>
> >> FWIW, this patch seems reasonable to me. To Christoph's point, I don't
> >> think either interface really grants license to write to the underlying
> >> blocks, so either way it's technically being abused for this purpose.
> >> Unless there's a clear way to return an error for a particular type of
> >> file, I think it's reasonable behavior for fibmap to expose the data it
> >> supports (i.e., block maps) and drop the data it doesn't (reflink
> >> state).
> > 
> > But shared block status isn't something that can be dropped lightly.  If
> > you write to a shared block without realizing it, you'll corrupt every
> > other file that shares the block.
> 
> But there is no circumstance under which it is safe to write to a mapped
> block no matter how you mapped it, tbh.  This is just singling out one case
> of many, and it seems capricious to me.  Other than the blast zone being
> possibly larger for reflinked files ... but I just don't think that's our
> judgement to make here.
> 
> > I prefer to have FIBMAP return errors to *cough* encourage people to use
> > FIEMAP.  If code are going to abuse the FI[BE]MAP interface they could
> > at least abuse the one that gives it enough context to avoid fs
> > corruption.  (A proper fs driver would be preferable, though very
> > difficult).
> > 

I think that's a reasonable option as well..

> > Granted, grub's blocklist code doesn't seem to check for shared blocks
> > when it writes grubenv.... yuck, though TBH I don't have the eye budget
> > to spend on digging through grub2.
> 
> grub2 doesn't even use fiemap or fibmap so I'm not sure it's relevant
> to this decision...
> 
> > Frankly I think FIBMAP comes verrry
> > close to "this API is unfixably stupid and shouldn't be enabled for new
> > use cases and should go away some day".
> So instead if anyone asks we'll just give them a successful response which
> is indistinguishable from a hole.  :(
> 

... but this seems to be the crux of the matter (IMO, at least). If we
can return -ENOTSUPP or whatever, then it can be made obvious that the
user either needs to use fiemap or avoid using reflinked files. ISTM
that what we do now is essentially report an incorrect bmap, which leads
to these subtle bug reports.

I haven't dug into the fibmap code.. does something prevent returning a
legitimate error code?

Brian

> I'm pretty strongly of the opinion that we should give the user what
> they ask for, and not try to intuit their motives for the question.
> 
> -Eric
Eric Sandeen Aug. 30, 2018, 7:47 p.m. UTC | #9
On 8/30/18 2:39 PM, Brian Foster wrote:
>>> Frankly I think FIBMAP comes verrry
>>> close to "this API is unfixably stupid and shouldn't be enabled for new
>>> use cases and should go away some day".

Backing up, the interface isn't /that/ dumb, other than being limited to
32 bits.  Q: "Where is this logical file block physically located?"  A: "It is here."

Inefficient, sure.

>> So instead if anyone asks we'll just give them a successful response which
>> is indistinguishable from a hole.  :(
>>
> ... but this seems to be the crux of the matter (IMO, at least). If we
> can return -ENOTSUPP or whatever, then it can be made obvious that the
> user either needs to use fiemap or avoid using reflinked files. ISTM
> that what we do now is essentially report an incorrect bmap, which leads
> to these subtle bug reports.
> 
> I haven't dug into the fibmap code.. does something prevent returning a
> legitimate error code?

so ->bmap() returns a sector_t, which is a u64 and doesn't really
leave room for a negative error return.  But the ioctl interface 
stuffs that return into an int, and returns it to the user.
(note to self, wonder why it doesn't return -EOVERFLOW as needed).

I suppose ->bmap() could be changed to take a u64 in/out pointer like the
user interface does, and return 0 or -errno.  *Shrug* this seems like
a lot of work to avoid simply giving the user what they asked
for.  ;)

-Eric
Brian Foster Aug. 30, 2018, 7:58 p.m. UTC | #10
On Thu, Aug 30, 2018 at 02:47:09PM -0500, Eric Sandeen wrote:
> On 8/30/18 2:39 PM, Brian Foster wrote:
> >>> Frankly I think FIBMAP comes verrry
> >>> close to "this API is unfixably stupid and shouldn't be enabled for new
> >>> use cases and should go away some day".
> 
> Backing up, the interface isn't /that/ dumb, other than being limited to
> 32 bits.  Q: "Where is this logical file block physically located?"  A: "It is here."
> 
> Inefficient, sure.
> 
> >> So instead if anyone asks we'll just give them a successful response which
> >> is indistinguishable from a hole.  :(
> >>
> > ... but this seems to be the crux of the matter (IMO, at least). If we
> > can return -ENOTSUPP or whatever, then it can be made obvious that the
> > user either needs to use fiemap or avoid using reflinked files. ISTM
> > that what we do now is essentially report an incorrect bmap, which leads
> > to these subtle bug reports.
> > 
> > I haven't dug into the fibmap code.. does something prevent returning a
> > legitimate error code?
> 
> so ->bmap() returns a sector_t, which is a u64 and doesn't really
> leave room for a negative error return.  But the ioctl interface 
> stuffs that return into an int, and returns it to the user.
> (note to self, wonder why it doesn't return -EOVERFLOW as needed).
> 

Ah, so the syscall can return an error, but the internal interface
basically doesn't support error propagation from the fs callback.

> I suppose ->bmap() could be changed to take a u64 in/out pointer like the
> user interface does, and return 0 or -errno.  *Shrug* this seems like
> a lot of work to avoid simply giving the user what they asked
> for.  ;)
> 

FWIW, another option might be to drop the ->bmap() callback on reflinked
inodes, which looks like it would trigger an -EINVAL to userspace. I
also don't know if that is worth whatever trouble might be required to
change function pointers like that, if it can even be done safely. I
guess my take boils down to that I think either returning an error (one
way or another) or accurate block map info is more sane than the current
behavior.

Brian

> -Eric
Dave Chinner Aug. 31, 2018, 12:11 a.m. UTC | #11
On Thu, Aug 30, 2018 at 01:51:56PM -0500, Eric Sandeen wrote:
> On 8/30/18 1:28 PM, Darrick J. Wong wrote:
> > On Thu, Aug 30, 2018 at 02:02:05PM -0400, Brian Foster wrote:
> >> On Thu, Aug 30, 2018 at 11:35:46AM -0500, Eric Sandeen wrote:
> >>> On 8/30/18 11:36 AM, Christoph Hellwig wrote:
> >>>> On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
> >>>>> That's no reason to uniquely disallow it for reflinked files, though;
> >>>>> the problem is universal.  It's true for fiemap as well.  So I'm not sure
> >>>>> that's an argument against the patch?
> >>>>
> >>>> fiemap at least tells you an extent is shared, bmap does not.
> >>>
> >>> yes, so bmap is clearly the wrong interface to use if you want to
> >>> write directly to a file's blocks.  But if you know enough to check
> >>> the fiemap shared flag, you know enough to not use fibmap for that purpose...
> >>>
> >>
> >> FWIW, this patch seems reasonable to me. To Christoph's point, I don't
> >> think either interface really grants license to write to the underlying
> >> blocks, so either way it's technically being abused for this purpose.
> >> Unless there's a clear way to return an error for a particular type of
> >> file, I think it's reasonable behavior for fibmap to expose the data it
> >> supports (i.e., block maps) and drop the data it doesn't (reflink
> >> state).
> > 
> > But shared block status isn't something that can be dropped lightly.  If
> > you write to a shared block without realizing it, you'll corrupt every
> > other file that shares the block.
> 
> But there is no circumstance under which it is safe to write to a mapped
> block no matter how you mapped it, tbh.

<sigh>

That's what all the break_layouts() code in XFS provides. It's a
mechanism for applications to prevent the block layout from changing
unexpected until they - the layout lease owner - give up their
exclusive access to the file layout.

Seriously, this has been talked about so much in the past year or
two in the context of DAX, RDMA, get_user_pages() races in direct
IO, etc. it pains me to see this discussion rehashing it all over
again.

We want applications to do what they need to do safely.  FIBMAP is
unsafe and, worse, it's unfixable. We need to get apps to move away
from it to something is actualayl safe.

Adding a file lease interface to block 3rd party changes to the
file layout until the app releases the lease is a safe way
of allowing userspace apps to use FIEMAP to map and identify
file extents they can write directly to if they need to.

IOWs, we need to get the FL_LAYOUT flag out into the external file
lease interface (IIRC Dan Williams posted patches for this a while
back) and get these "FIBMAP + write()" apps to use "FL_LAYOUT,
fsync(), FIEMAP, write(), ~FL_LAYOUT".

We need to make FIBMAP go away by providing a safer, more robust
solution to the problem people are trying to solve.

-Dave.
Eric Sandeen Aug. 31, 2018, 1:34 a.m. UTC | #12
On 8/30/18 7:11 PM, Dave Chinner wrote:
> On Thu, Aug 30, 2018 at 01:51:56PM -0500, Eric Sandeen wrote:
>> On 8/30/18 1:28 PM, Darrick J. Wong wrote:
>>> On Thu, Aug 30, 2018 at 02:02:05PM -0400, Brian Foster wrote:
>>>> On Thu, Aug 30, 2018 at 11:35:46AM -0500, Eric Sandeen wrote:
>>>>> On 8/30/18 11:36 AM, Christoph Hellwig wrote:
>>>>>> On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
>>>>>>> That's no reason to uniquely disallow it for reflinked files, though;
>>>>>>> the problem is universal.  It's true for fiemap as well.  So I'm not sure
>>>>>>> that's an argument against the patch?
>>>>>>
>>>>>> fiemap at least tells you an extent is shared, bmap does not.
>>>>>
>>>>> yes, so bmap is clearly the wrong interface to use if you want to
>>>>> write directly to a file's blocks.  But if you know enough to check
>>>>> the fiemap shared flag, you know enough to not use fibmap for that purpose...
>>>>>
>>>>
>>>> FWIW, this patch seems reasonable to me. To Christoph's point, I don't
>>>> think either interface really grants license to write to the underlying
>>>> blocks, so either way it's technically being abused for this purpose.
>>>> Unless there's a clear way to return an error for a particular type of
>>>> file, I think it's reasonable behavior for fibmap to expose the data it
>>>> supports (i.e., block maps) and drop the data it doesn't (reflink
>>>> state).
>>>
>>> But shared block status isn't something that can be dropped lightly.  If
>>> you write to a shared block without realizing it, you'll corrupt every
>>> other file that shares the block.
>>
>> But there is no circumstance under which it is safe to write to a mapped
>> block no matter how you mapped it, tbh.
> 
> <sigh>
> 
> That's what all the break_layouts() code in XFS provides. It's a
> mechanism for applications to prevent the block layout from changing
> unexpected until they - the layout lease owner - give up their
> exclusive access to the file layout.

> Seriously, this has been talked about so much in the past year or
> two in the context of DAX, RDMA, get_user_pages() races in direct
> IO, etc. it pains me to see this discussion rehashing it all over
> again.
> 
> We want applications to do what they need to do safely.  FIBMAP is
> unsafe and, worse, it's unfixable. We need to get apps to move away
> from it to something is actualayl safe.


> Adding a file lease interface to block 3rd party changes to the
> file layout until the app releases the lease is a safe way
> of allowing userspace apps to use FIEMAP to map and identify
> file extents they can write directly to if they need to.
> 
> IOWs, we need to get the FL_LAYOUT flag out into the external file
> lease interface (IIRC Dan Williams posted patches for this a while
> back) and get these "FIBMAP + write()" apps to use "FL_LAYOUT,
> fsync(), FIEMAP, write(), ~FL_LAYOUT".
> 
> We need to make FIBMAP go away by providing a safer, more robust
> solution to the problem people are trying to solve.

Sure.  I get it that it's not a great interface.  I get it that there
are better ways.  When those methods are available, we should explicitly
deprecate FIBMAP.

But until then I can't understand why we'd intentionally break an
otherwise reasonably functional interface in subtle and undetectable
ways for certain classes of files.  We /could/ FIBMAP a reflinked file
exactly as well as we can FIBMAP a non-reflinked file, but we choose
not to.  This choice creates unnecessary problems for existing apps.

Until we deprecate the FIBMAP interface, until there is a better way,
I think we should make it as predictable and complete as
we can, not cripple it intentionally.

But I'm clearly in the minority with that opinion, so I guess I'll
withdraw the patch.

-Eric
Dave Chinner Aug. 31, 2018, 3:05 a.m. UTC | #13
On Thu, Aug 30, 2018 at 08:34:02PM -0500, Eric Sandeen wrote:
> On 8/30/18 7:11 PM, Dave Chinner wrote:
> > On Thu, Aug 30, 2018 at 01:51:56PM -0500, Eric Sandeen wrote:
> >> On 8/30/18 1:28 PM, Darrick J. Wong wrote:
> >>> On Thu, Aug 30, 2018 at 02:02:05PM -0400, Brian Foster wrote:
> >>>> On Thu, Aug 30, 2018 at 11:35:46AM -0500, Eric Sandeen wrote:
> >>>>> On 8/30/18 11:36 AM, Christoph Hellwig wrote:
> >>>>>> On Thu, Aug 30, 2018 at 11:31:40AM -0500, Eric Sandeen wrote:
> >>>>>>> That's no reason to uniquely disallow it for reflinked files, though;
> >>>>>>> the problem is universal.  It's true for fiemap as well.  So I'm not sure
> >>>>>>> that's an argument against the patch?
> >>>>>>
> >>>>>> fiemap at least tells you an extent is shared, bmap does not.
> >>>>>
> >>>>> yes, so bmap is clearly the wrong interface to use if you want to
> >>>>> write directly to a file's blocks.  But if you know enough to check
> >>>>> the fiemap shared flag, you know enough to not use fibmap for that purpose...
> >>>>>
> >>>>
> >>>> FWIW, this patch seems reasonable to me. To Christoph's point, I don't
> >>>> think either interface really grants license to write to the underlying
> >>>> blocks, so either way it's technically being abused for this purpose.
> >>>> Unless there's a clear way to return an error for a particular type of
> >>>> file, I think it's reasonable behavior for fibmap to expose the data it
> >>>> supports (i.e., block maps) and drop the data it doesn't (reflink
> >>>> state).
> >>>
> >>> But shared block status isn't something that can be dropped lightly.  If
> >>> you write to a shared block without realizing it, you'll corrupt every
> >>> other file that shares the block.
> >>
> >> But there is no circumstance under which it is safe to write to a mapped
> >> block no matter how you mapped it, tbh.
> > 
> > <sigh>
> > 
> > That's what all the break_layouts() code in XFS provides. It's a
> > mechanism for applications to prevent the block layout from changing
> > unexpected until they - the layout lease owner - give up their
> > exclusive access to the file layout.
> 
> > Seriously, this has been talked about so much in the past year or
> > two in the context of DAX, RDMA, get_user_pages() races in direct
> > IO, etc. it pains me to see this discussion rehashing it all over
> > again.
> > 
> > We want applications to do what they need to do safely.  FIBMAP is
> > unsafe and, worse, it's unfixable. We need to get apps to move away
> > from it to something is actualayl safe.
> 
> 
> > Adding a file lease interface to block 3rd party changes to the
> > file layout until the app releases the lease is a safe way
> > of allowing userspace apps to use FIEMAP to map and identify
> > file extents they can write directly to if they need to.
> > 
> > IOWs, we need to get the FL_LAYOUT flag out into the external file
> > lease interface (IIRC Dan Williams posted patches for this a while
> > back) and get these "FIBMAP + write()" apps to use "FL_LAYOUT,
> > fsync(), FIEMAP, write(), ~FL_LAYOUT".
> > 
> > We need to make FIBMAP go away by providing a safer, more robust
> > solution to the problem people are trying to solve.
> 
> Sure.  I get it that it's not a great interface.  I get it that there
> are better ways.  When those methods are available, we should explicitly
> deprecate FIBMAP.
> 
> But until then I can't understand why we'd intentionally break an
> otherwise reasonably functional interface in subtle and undetectable
> ways for certain classes of files.

Because that's better than subtle, silent and undetectable data
corruption of certain classes of files in certain common FIBMAP use
cases.

> We /could/ FIBMAP a reflinked file
> exactly as well as we can FIBMAP a non-reflinked file, but we choose
> not to.  This choice creates unnecessary problems for existing apps.

Yes, we /could/. But we don't, because because of the typical use
case of FIBMAP which is to map files so that IO (read and/or write)
can be done directly to the block device without going through the
filesystem. For normal files this is /relatively/ safe if the
correct precautions are taken. For a shared extent, writing is
*never* safe.

We don't know what users of FIBMAP are going to do with the block
map they get, but we *know* there are apps that use it to write
direct to block devices and we *know* that this will cause _silent_
data corruption and/or exposure of privileged information (shared
data can have different owners at different permission levels) when
it occurs.

That's what we're stopping by not exposing shared extents via FIBMAP
- applications that are careful and try to do everything as safely
and correctly as possible will still get it wrong and have no
indication that they've just screwed something up badly.

> Until we deprecate the FIBMAP interface, until there is a better way,
> I think we should make it as predictable and complete as
> we can, not cripple it intentionally.

Crippling it is the lesser of two evils, unfortunately. I don't like
it either, but I like the idea of silent data corruption or data
exposure vectors even less.

Cheers,

Dave.
Christoph Hellwig Aug. 31, 2018, 6:28 a.m. UTC | #14
On Thu, Aug 30, 2018 at 11:28:49AM -0700, Darrick J. Wong wrote:
> I prefer to have FIBMAP return errors to *cough* encourage people to use
> FIEMAP.  If code are going to abuse the FI[BE]MAP interface they could
> at least abuse the one that gives it enough context to avoid fs
> corruption.  (A proper fs driver would be preferable, though very
> difficult).

I think Carlos was looking into implementing the FIBMAP ioctl
using ->fiemap.  In that case we could return sensible errors,
and centralize policy in a single place..

> Granted, grub's blocklist code doesn't seem to check for shared blocks
> when it writes grubenv.... yuck, though TBH I don't have the eye budget
> to spend on digging through grub2.  Frankly I think FIBMAP comes verrry
> close to "this API is unfixably stupid and shouldn't be enabled for new
> use cases and should go away some day".

.. and that policy should be: always return an error for the slightest
unusual file layout (shared, encrypted, inline, etc).
Brian Foster Aug. 31, 2018, 12:36 p.m. UTC | #15
On Fri, Aug 31, 2018 at 08:28:13AM +0200, Christoph Hellwig wrote:
> On Thu, Aug 30, 2018 at 11:28:49AM -0700, Darrick J. Wong wrote:
> > I prefer to have FIBMAP return errors to *cough* encourage people to use
> > FIEMAP.  If code are going to abuse the FI[BE]MAP interface they could
> > at least abuse the one that gives it enough context to avoid fs
> > corruption.  (A proper fs driver would be preferable, though very
> > difficult).
> 
> I think Carlos was looking into implementing the FIBMAP ioctl
> using ->fiemap.  In that case we could return sensible errors,
> and centralize policy in a single place..
> 

So basically ioctl_fibmap() either prioritizes ->fiemap() or looks for
some special combination of (fiemap && !bmap) to translate the call..

> > Granted, grub's blocklist code doesn't seem to check for shared blocks
> > when it writes grubenv.... yuck, though TBH I don't have the eye budget
> > to spend on digging through grub2.  Frankly I think FIBMAP comes verrry
> > close to "this API is unfixably stupid and shouldn't be enabled for new
> > use cases and should go away some day".
> 
> .. and that policy should be: always return an error for the slightest
> unusual file layout (shared, encrypted, inline, etc).

... and then return some error if the associate extent is in some state
that cannot be described by fibmap..? That sounds like a nice option to
me. Carlos..?

Maybe it's too late for this, but I think even dropping ->bmap
completely for the time being on XFS reflink=1 filesystems is preferable
to the current behavior where we return a perfectly valid result and
pretend that somehow represents an error to userspace.

The arguments for the current behavior essentially apply the "known
fibmap usecase of direct block writes" as justification for implementing
this policy in the kernel. In practice, the current behavior just trades
off one problem (data corruption) for another where the end result is
probably the same for that particular use case: the system doesn't boot.
If we dropped bmap, then at least there's an obvious error and the user
can decide whether to update to fiemap or disable reflink (as opposed to
us having to continue to chase down these odd bootloader issues).

Brian
Eric Sandeen Aug. 31, 2018, 1:08 p.m. UTC | #16
On 8/30/18 10:05 PM, Dave Chinner wrote:
> On Thu, Aug 30, 2018 at 08:34:02PM -0500, Eric Sandeen wrote:

...

>> We /could/ FIBMAP a reflinked file
>> exactly as well as we can FIBMAP a non-reflinked file, but we choose
>> not to.  This choice creates unnecessary problems for existing apps.
> 
> Yes, we /could/. But we don't, because because of the typical use
> case of FIBMAP which is to map files so that IO (read and/or write)
> can be done directly to the block device without going through the
> filesystem. For normal files this is /relatively/ safe if the
> correct precautions are taken. For a shared extent, writing is
> *never* safe.
> 
> We don't know what users of FIBMAP are going to do with the block
> map they get, but we *know* there are apps that use it to write
> direct to block devices and we *know* that this will cause _silent_
> data corruption and/or exposure of privileged information (shared
> data can have different owners at different permission levels) when
> it occurs.

It'd be way better for my blood pressure if people would stop making
poor arguments about orthogonal issues.

If you can directly read the block device hosting the files, file
permissions of shared blocks are completely beside the point.

And app writers can make bad/dangerous decisions no matter what mapping
mechanisms we provide.

(If an app writer knows enough to look at FIEMAP flags to check for shared
blocks before writing, they'd know enough to use it instead of FIBMAP in
the first place.)

But I'm not even aware of any apps that actually /do/ use FI[BE]MAP mapping
information to write, in any case.

While grub2 does directly write file blocks in one case, it doesn't
even /use/ FIBMAP!  So it will happily continue doing so, even for
reflinked files, despite our best efforts to break the interface for
safe users.


The "typical use case" is to map a file for a bootloader to /read/
at boot, and our arbitrary restriction has broken systems in
the real world, and burned a lot of institutional effort getting
to the bottom of the problem and working around it.

I know it's our job to be pedantic, secure, and correct, but speculating
about what misbehaving privileged apps /could/ do and then plugging a
small fraction of the possible misuse vectors while breaking actual, safe,
real-world usecases just isn't a compelling story.

-Eric
Christoph Hellwig Sept. 1, 2018, 8:31 a.m. UTC | #17
[adding Carlos who was looking into this if I remember correctly]

On Fri, Aug 31, 2018 at 08:36:40AM -0400, Brian Foster wrote:
> So basically ioctl_fibmap() either prioritizes ->fiemap() or looks for
> some special combination of (fiemap && !bmap) to translate the call..

I'd always use fiemap if present, but if there is any good reason
to prefer ->bmap if present that should be ok.  Hopefully we can just
kill of pure ->bmap implementations in the long run..

> > > Granted, grub's blocklist code doesn't seem to check for shared blocks
> > > when it writes grubenv.... yuck, though TBH I don't have the eye budget
> > > to spend on digging through grub2.  Frankly I think FIBMAP comes verrry
> > > close to "this API is unfixably stupid and shouldn't be enabled for new
> > > use cases and should go away some day".
> > 
> > .. and that policy should be: always return an error for the slightest
> > unusual file layout (shared, encrypted, inline, etc).
> 
> ... and then return some error if the associate extent is in some state
> that cannot be described by fibmap..? That sounds like a nice option to
> me. Carlos..?

Yes.

> Maybe it's too late for this, but I think even dropping ->bmap
> completely for the time being on XFS reflink=1 filesystems is preferable
> to the current behavior where we return a perfectly valid result and
> pretend that somehow represents an error to userspace.

Note that a 0 return on ->bmap has traditionally been treated as an
error, including in userspace as block 0 is usually not a valid block
for user data.   (this isn't intended as support for this interface,
just stating history)
Christoph Hellwig Sept. 1, 2018, 8:32 a.m. UTC | #18
On Fri, Aug 31, 2018 at 08:08:11AM -0500, Eric Sandeen wrote:
> If you can directly read the block device hosting the files, file
> permissions of shared blocks are completely beside the point.

The point is that you should never, ever directly read blocks from
the block device, nevermind write.  And we need to make sure userspace
stops doing that instead of catering to it in any way.

> The "typical use case" is to map a file for a bootloader to /read/
> at boot, and our arbitrary restriction has broken systems in
> the real world, and burned a lot of institutional effort getting
> to the bottom of the problem and working around it.

And even that 'use case' is utterly broken.
Carlos Maiolino Sept. 2, 2018, 2:08 p.m. UTC | #19
Hi Folks,

On Fri, Aug 31, 2018 at 08:36:40AM -0400, Brian Foster wrote:
> On Fri, Aug 31, 2018 at 08:28:13AM +0200, Christoph Hellwig wrote:
> > On Thu, Aug 30, 2018 at 11:28:49AM -0700, Darrick J. Wong wrote:
> > > I prefer to have FIBMAP return errors to *cough* encourage people to use
> > > FIEMAP.  If code are going to abuse the FI[BE]MAP interface they could
> > > at least abuse the one that gives it enough context to avoid fs
> > > corruption.  (A proper fs driver would be preferable, though very
> > > difficult).
> > 
> > I think Carlos was looking into implementing the FIBMAP ioctl
> > using ->fiemap.  In that case we could return sensible errors,
> > and centralize policy in a single place..
> > 
> 
> So basically ioctl_fibmap() either prioritizes ->fiemap() or looks for
> some special combination of (fiemap && !bmap) to translate the call..
> 
> > > Granted, grub's blocklist code doesn't seem to check for shared blocks
> > > when it writes grubenv.... yuck, though TBH I don't have the eye budget
> > > to spend on digging through grub2.  Frankly I think FIBMAP comes verrry
> > > close to "this API is unfixably stupid and shouldn't be enabled for new
> > > use cases and should go away some day".
> > 
> > .. and that policy should be: always return an error for the slightest
> > unusual file layout (shared, encrypted, inline, etc).
> 
> ... and then return some error if the associate extent is in some state
> that cannot be described by fibmap..? That sounds like a nice option to
> me. Carlos..?
> 

Yes, I've been working on using FIEMAP interface to handle FIBMAP, it was mostly
working, although it needed some extra tweaks due the fact different
filesystems return different blocks inside an extent, when a single block query
is made on FIEMAP.

I mean, if you query for a single block which is in the middle of an extent,
ext4 returns the address of the specific block inside the extent, while xfs
(using iomap fiemap infra), returns the address of the first block in the
extent.
Or something like that, I needed to context switch to some other tasks, but I'll
come back to this during this week, and let you guys know.

Cheers

> Maybe it's too late for this, but I think even dropping ->bmap
> completely for the time being on XFS reflink=1 filesystems is preferable
> to the current behavior where we return a perfectly valid result and
> pretend that somehow represents an error to userspace.
> 
> The arguments for the current behavior essentially apply the "known
> fibmap usecase of direct block writes" as justification for implementing
> this policy in the kernel. In practice, the current behavior just trades
> off one problem (data corruption) for another where the end result is
> probably the same for that particular use case: the system doesn't boot.
> If we dropped bmap, then at least there's an obvious error and the user
> can decide whether to update to fiemap or disable reflink (as opposed to
> us having to continue to chase down these odd bootloader issues).
> 
> Brian
Eric Sandeen Sept. 2, 2018, 5:52 p.m. UTC | #20
On 9/2/18 9:08 AM, Carlos Maiolino wrote:
> Hi Folks,
> 
> On Fri, Aug 31, 2018 at 08:36:40AM -0400, Brian Foster wrote:
>> On Fri, Aug 31, 2018 at 08:28:13AM +0200, Christoph Hellwig wrote:
>>> On Thu, Aug 30, 2018 at 11:28:49AM -0700, Darrick J. Wong wrote:
>>>> I prefer to have FIBMAP return errors to *cough* encourage people to use
>>>> FIEMAP.  If code are going to abuse the FI[BE]MAP interface they could
>>>> at least abuse the one that gives it enough context to avoid fs
>>>> corruption.  (A proper fs driver would be preferable, though very
>>>> difficult).
>>>
>>> I think Carlos was looking into implementing the FIBMAP ioctl
>>> using ->fiemap.  In that case we could return sensible errors,
>>> and centralize policy in a single place..
>>>
>>
>> So basically ioctl_fibmap() either prioritizes ->fiemap() or looks for
>> some special combination of (fiemap && !bmap) to translate the call..
>>
>>>> Granted, grub's blocklist code doesn't seem to check for shared blocks
>>>> when it writes grubenv.... yuck, though TBH I don't have the eye budget
>>>> to spend on digging through grub2.  Frankly I think FIBMAP comes verrry
>>>> close to "this API is unfixably stupid and shouldn't be enabled for new
>>>> use cases and should go away some day".
>>>
>>> .. and that policy should be: always return an error for the slightest
>>> unusual file layout (shared, encrypted, inline, etc).
>>
>> ... and then return some error if the associate extent is in some state
>> that cannot be described by fibmap..? That sounds like a nice option to
>> me. Carlos..?
>>
> 
> Yes, I've been working on using FIEMAP interface to handle FIBMAP, it was mostly
> working, although it needed some extra tweaks due the fact different
> filesystems return different blocks inside an extent, when a single block query
> is made on FIEMAP.
> 
> I mean, if you query for a single block which is in the middle of an extent,
> ext4 returns the address of the specific block inside the extent, while xfs
> (using iomap fiemap infra), returns the address of the first block in the
> extent.

IMHO, with hindsight, FIEMAP really kind of sucks.  The call is a pain to
set up, and the results are a pain to interpret.

Preparing a patch for zipl to use FIEMAP, I realized what a truly crappy and
cumbersome interface it is, particularly if you just want to map a small
handful of blocks.

I doubt it'd fly, but I'm half tempted to propose a new block-at-at-time
FIBMAP2 that doesn't overflow 32 bits, uses flags similar to FIEMAP to control
behavior and return mapped block state, and can return proper errors.

FIEMAP is great that it can efficiently map contiguous extents and all, but
it's so cumbersome to use.  (ISTR that both xfsprogs' and e2fsprogs' use of
FIEMAP had /multiple/ followon commits fixing the original implementation
in filefrag and xfs_io.)

-Eric
Carlos Maiolino Sept. 3, 2018, 10:21 a.m. UTC | #21
On Sun, Sep 02, 2018 at 12:52:47PM -0500, Eric Sandeen wrote:
> On 9/2/18 9:08 AM, Carlos Maiolino wrote:
> > Hi Folks,
> > 
> > On Fri, Aug 31, 2018 at 08:36:40AM -0400, Brian Foster wrote:
> >> On Fri, Aug 31, 2018 at 08:28:13AM +0200, Christoph Hellwig wrote:
> >>> On Thu, Aug 30, 2018 at 11:28:49AM -0700, Darrick J. Wong wrote:
> >>>> I prefer to have FIBMAP return errors to *cough* encourage people to use
> >>>> FIEMAP.  If code are going to abuse the FI[BE]MAP interface they could
> >>>> at least abuse the one that gives it enough context to avoid fs
> >>>> corruption.  (A proper fs driver would be preferable, though very
> >>>> difficult).
> >>>
> >>> I think Carlos was looking into implementing the FIBMAP ioctl
> >>> using ->fiemap.  In that case we could return sensible errors,
> >>> and centralize policy in a single place..
> >>>
> >>
> >> So basically ioctl_fibmap() either prioritizes ->fiemap() or looks for
> >> some special combination of (fiemap && !bmap) to translate the call..
> >>
> >>>> Granted, grub's blocklist code doesn't seem to check for shared blocks
> >>>> when it writes grubenv.... yuck, though TBH I don't have the eye budget
> >>>> to spend on digging through grub2.  Frankly I think FIBMAP comes verrry
> >>>> close to "this API is unfixably stupid and shouldn't be enabled for new
> >>>> use cases and should go away some day".
> >>>
> >>> .. and that policy should be: always return an error for the slightest
> >>> unusual file layout (shared, encrypted, inline, etc).
> >>
> >> ... and then return some error if the associate extent is in some state
> >> that cannot be described by fibmap..? That sounds like a nice option to
> >> me. Carlos..?
> >>
> > 
> > Yes, I've been working on using FIEMAP interface to handle FIBMAP, it was mostly
> > working, although it needed some extra tweaks due the fact different
> > filesystems return different blocks inside an extent, when a single block query
> > is made on FIEMAP.
> > 
> > I mean, if you query for a single block which is in the middle of an extent,
> > ext4 returns the address of the specific block inside the extent, while xfs
> > (using iomap fiemap infra), returns the address of the first block in the
> > extent.

Ops, after re-reading my notes about it today, it's quite the opposite. xfs
(with iomap), returns the requested block, while ext4 returns the beginning of
the extent.
> 
> IMHO, with hindsight, FIEMAP really kind of sucks.  The call is a pain to
> set up, and the results are a pain to interpret.
> 
> Preparing a patch for zipl to use FIEMAP, I realized what a truly crappy and
> cumbersome interface it is, particularly if you just want to map a small
> handful of blocks.
> 
> I doubt it'd fly, but I'm half tempted to propose a new block-at-at-time
> FIBMAP2 that doesn't overflow 32 bits, uses flags similar to FIEMAP to control
> behavior and return mapped block state, and can return proper errors.
> 
> FIEMAP is great that it can efficiently map contiguous extents and all, but
> it's so cumbersome to use.  (ISTR that both xfsprogs' and e2fsprogs' use of
> FIEMAP had /multiple/ followon commits fixing the original implementation
> in filefrag and xfs_io.)
> 

I'm still refreshing my memory at the subject, but, IIRC the past discussions,
we will need to support FIBMAP forever, but, internally we can actually use
FIEMAP infra-structure to answer FIBMAP calls. From a user perspective, it
changes nothing.

The idea, is to replace all the ->bmap calls, by calling into bmap directly, and
make bmap() internals to use FIEMAP infra-structure to get the required single
block.

I made it work successfully on systems using only XFS, until I tested on ext4
and the system didn't boot due the difference between iomap and ext4 FIEMAP
implementation I explained above.

Actually, IIRC I put this project by side because we haven't decided which
interface would be better, but looks like the right approach would be to use an
fs-independent solution, which is something I'll try to achieve this week.

Cheers

> -Eric
diff mbox series

Patch

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 8eb3ba3d4d00..b75a7f37bdd9 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1382,15 +1382,10 @@  xfs_vm_bmap(
 	trace_xfs_vm_bmap(ip);
 
 	/*
-	 * The swap code (ab-)uses ->bmap to get a block mapping and then
-	 * bypasses the file system for actual I/O.  We really can't allow
-	 * that on reflinks inodes, so we have to skip out here.  And yes,
-	 * 0 is the magic code for a bmap error.
-	 *
 	 * Since we don't pass back blockdev info, we can't return bmap
-	 * information for rt files either.
+	 * information for rt files.
 	 */
-	if (xfs_is_reflink_inode(ip) || XFS_IS_REALTIME_INODE(ip))
+	if (XFS_IS_REALTIME_INODE(ip))
 		return 0;
 	return iomap_bmap(mapping, block, &xfs_iomap_ops);
 }
@@ -1477,7 +1472,13 @@  xfs_iomap_swapfile_activate(
 	struct file			*swap_file,
 	sector_t			*span)
 {
-	sis->bdev = xfs_find_bdev_for_inode(file_inode(swap_file));
+	struct inode			*inode = file_inode(swap_file);
+
+	/* Cannot allow swap to write directly to reflinked files/blocks */
+	if (xfs_is_reflink_inode(XFS_I(inode)))
+		return -EOPNOTSUPP;
+
+	sis->bdev = xfs_find_bdev_for_inode(inode);
 	return iomap_swapfile_activate(sis, swap_file, span, &xfs_iomap_ops);
 }