mbox series

[RFC,0/2] fuse: introduce fuse server recovery mechanism

Message ID 20240524064030.4944-1-jefflexu@linux.alibaba.com (mailing list archive)
Headers show
Series fuse: introduce fuse server recovery mechanism | expand

Message

Jingbo Xu May 24, 2024, 6:40 a.m. UTC
Background
==========
The fd of '/dev/fuse' serves as a message transmission channel between
FUSE filesystem (kernel space) and fuse server (user space). Once the
fd gets closed (intentionally or unintentionally), the FUSE filesystem
gets aborted, and any attempt of filesystem access gets -ECONNABORTED
error until the FUSE filesystem finally umounted.

It is one of the requisites in production environment to provide
uninterruptible filesystem service.  The most straightforward way, and
maybe the most widely used way, is that make another dedicated user
daemon (similar to systemd fdstore) keep the device fd open.  When the
fuse daemon recovers from a crash, it can retrieve the device fd from the
fdstore daemon through socket takeover (Unix domain socket) method [1]
or pidfd_getfd() syscall [2].  In this way, as long as the fdstore
daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
daemon crashes, though the filesystem service may hang there for a while
when the fuse daemon gets restarted and has not been completely
recovered yet.

This picture indeed works and has been deployed in our internal
production environment until the following issues are encountered:

1. The fdstore daemon may be killed by mistake, in which case the FUSE
filesystem gets aborted and irrecoverable.

2. In scenarios of containerized deployment, the fuse daemon is deployed
in a container POD, and a dedicated fdstore daemon needs to be deployed
for each fuse daemon.  The fdstore daemon could consume a amount of
resources (e.g. memory footprint), which is not conducive to the dense
container deployment.

3. Each fuse daemon implementation needs to implement its own fdstore
daemon.  If we implement the fuse recovery mechanism on the kernel side,
all fuse daemon implementations could reuse this mechanism.


What we do
==========

Basic Recovery Mechanism
------------------------
We introduce a recovery mechanism for fuse server on the kernel side.

To do this:
1. Introduce a new "tag=" mount option, with which users could identify
a fuse connection with a unique name.
2. Introduce a new FUSE_DEV_IOC_ATTACH ioctl, with which the fuse server
could reconnect to the fuse connection corresponding to the given tag.
3. Introduce a new FUSE_HAS_RECOVERY init flag.  The fuse server should
advertise this feature if it supports server recovery.


With the above recovery mechanism, the whole time sequence is like:
- At the initial mount, the fuse filesystem is mounted with "tag="
  option
- The fuse server advertises FUSE_HAS_RECOVERY flag when replying
  FUSE_INIT
- When the fuse server crashes and the (/dev/fuse) device fd is closed,
  the fuse connection won't be aborted.
- The requests submitted after the server crash will keep staying in
  the iqueue; the processes submitting the requests will hang there
- The fuse server gets restarted and recovers the previous state before
  crash (including the negotiation results of the last FUSE_INIT)
- The fuse server opens /dev/fuse and gets a new device fd, and then
  runs FUSE_DEV_IOC_ATTACH ioctl on the new device fd to retrieve the
  fuse connection with the tag previously used to mount the fuse
  filesystem
- The fuse server issues a FUSE_NOTIFY_RESEND notification to request
  the kernel to resend those inflight requests that have been sent to
  the fuse server before the server crash but not been replied yet
- The fuse server starts to process requests normally (those queued in
  iqueue and those resent by FUSE_NOTIFY_RESEND)

In summary, the requests submitted after the server crash will stay in
the iqueue and get serviced once the fuse server recovers from the crash
and retrieve the previous fuse connection.  As for the inflight requests
that have been sent to the fuse server before the server crash but not
been replied yet, the fuse server could request the kernel to resend
those inflight requests through FUSE_NOTIFY_RESEND notification type.


Security Enhancement
---------------------
Besides, we offer a uid-based security enhancement for the fuse server
recovery mechanism.  Otherwise any malicious attacker could kill the
fuse server and take the filesystem service over with the recovery
mechanism.

To implement this, we introduce a new "rescue_uid=" mount option
specifying the expected uid of the legal process running the fuse
server.  Then only the process with the matching uid is permissible to
retrieve the fuse connection with the server recovery mechanism.


Limitation
==========
1. The current mechanism won't resend a new FUSE_INIT request to fuse
server and start a new negotiation when the fuse server attempts to
re-attach to the fuse connection through FUSE_DEV_IOC_ATTACH ioctl.
Thus the fuse server needs to recover the previous state before crash
(including the negotiation results of the last FUSE_INIT) by itself.

PS. Thus I had to do hacking tricks on libfuse passthrough_ll daemon
when testing the recovery feature.

2. With the current recovery mechanism, the fuse filesystem won't get
aborted when the fuse server crashes.  A following umount will get hung
there.  The call stack shows the hang task is waiting for FUSE_GETATTR
on the mntpoint:

[<0>] request_wait_answer+0xe1/0x200
[<0>] fuse_simple_request+0x18e/0x2a0
[<0>] fuse_do_getattr+0xc9/0x180
[<0>] vfs_statx+0x92/0x170
[<0>] vfs_fstatat+0x7c/0xb0
[<0>] __do_sys_newstat+0x1d/0x40
[<0>] do_syscall_64+0x60/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

It's not fixed yet in this RFC version.

3. I don't know if a kernel based recovery mechanism is welcome on the
community side.  Any comment is welcome.  Thanks!


[1] https://copyconstruct.medium.com/file-descriptor-transfer-over-unix-domain-sockets-dcbbf5b3b6ec
[2] https://copyconstruct.medium.com/seamless-file-descriptor-transfer-between-processes-with-pidfd-and-pidfd-getfd-816afcd19ed4


Jingbo Xu (2):
  fuse: introduce recovery mechanism for fuse server
  fuse: uid-based security enhancement for the recovery mechanism

 fs/fuse/dev.c             | 55 ++++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h          | 15 +++++++++++
 fs/fuse/inode.c           | 46 +++++++++++++++++++++++++++++++-
 include/uapi/linux/fuse.h |  7 +++++
 4 files changed, 121 insertions(+), 2 deletions(-)

Comments

Miklos Szeredi May 27, 2024, 3:16 p.m. UTC | #1
On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> 3. I don't know if a kernel based recovery mechanism is welcome on the
> community side.  Any comment is welcome.  Thanks!

I'd prefer something external to fuse.

Maybe a kernel based fdstore (lifetime connected to that of the
container) would a useful service more generally?

Thanks,
Miklos
Jingbo Xu May 28, 2024, 2:45 a.m. UTC | #2
On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>> community side.  Any comment is welcome.  Thanks!
> 
> I'd prefer something external to fuse.

Okay, understood.

> 
> Maybe a kernel based fdstore (lifetime connected to that of the
> container) would a useful service more generally?

Yeah I indeed had considered this, but I'm afraid VFS guys would be
concerned about why we do this on kernel side rather than in user space.

I'm not sure what the VFS guys think about this and if the kernel side
shall care about this.

Many thanks!
Jingbo Xu May 28, 2024, 3:08 a.m. UTC | #3
On 5/28/24 10:45 AM, Jingbo Xu wrote:
> 
> 
> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>> community side.  Any comment is welcome.  Thanks!
>>
>> I'd prefer something external to fuse.
> 
> Okay, understood.
> 
>>
>> Maybe a kernel based fdstore (lifetime connected to that of the
>> container) would a useful service more generally?
> 
> Yeah I indeed had considered this, but I'm afraid VFS guys would be
> concerned about why we do this on kernel side rather than in user space.
> 
> I'm not sure what the VFS guys think about this and if the kernel side
> shall care about this.
> 

There was an RFC for kernel-side fdstore [1], though it's also
implemented upon FUSE.

[1]
https://lore.kernel.org/all/CA+a=Yy5rnqLqH2iR-ZY6AUkNJy48mroVV3Exmhmt-pfTi82kXA@mail.gmail.com/T/
Gao Xiang May 28, 2024, 4:02 a.m. UTC | #4
On 2024/5/28 11:08, Jingbo Xu wrote:
> 
> 
> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>
>>
>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>> community side.  Any comment is welcome.  Thanks!
>>>
>>> I'd prefer something external to fuse.
>>
>> Okay, understood.
>>
>>>
>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>> container) would a useful service more generally?
>>
>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>> concerned about why we do this on kernel side rather than in user space.

Just from my own perspective, even if it's in FUSE, the concern is
almost the same.

I wonder if on-demand cachefiles can keep fds too in the future
(thus e.g. daemonless feature could even be implemented entirely
with kernel fdstore) but it still has the same concern or it's
a source of duplication.

Thanks,
Gao Xiang

>>
>> I'm not sure what the VFS guys think about this and if the kernel side
>> shall care about this.
>>
> 
> There was an RFC for kernel-side fdstore [1], though it's also
> implemented upon FUSE.
> 
> [1]
> https://lore.kernel.org/all/CA+a=Yy5rnqLqH2iR-ZY6AUkNJy48mroVV3Exmhmt-pfTi82kXA@mail.gmail.com/T/
> 
> 
>
Miklos Szeredi May 28, 2024, 7:45 a.m. UTC | #5
On Tue, 28 May 2024 at 04:45, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> Yeah I indeed had considered this, but I'm afraid VFS guys would be
> concerned about why we do this on kernel side rather than in user space.
>
> I'm not sure what the VFS guys think about this and if the kernel side
> shall care about this.

Yes, that is indeed something that needs to be discussed.

I often find, that when discussing something like this a lot of good
ideas can come from different directions, so it can help move things
forward.

Try something really simple first, and post a patch.  Don't overthink
the first version.

Thanks,
Miklos
Miklos Szeredi May 28, 2024, 7:46 a.m. UTC | #6
On Tue, 28 May 2024 at 05:08, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> There was an RFC for kernel-side fdstore [1], though it's also
> implemented upon FUSE.

I strongly believe that this needs to be disassociated from fuse.

It could be a pseudo filesystem, though.

Thanks,
Miklos
Christian Brauner May 28, 2024, 8:38 a.m. UTC | #7
On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote:
> Background
> ==========
> The fd of '/dev/fuse' serves as a message transmission channel between
> FUSE filesystem (kernel space) and fuse server (user space). Once the
> fd gets closed (intentionally or unintentionally), the FUSE filesystem
> gets aborted, and any attempt of filesystem access gets -ECONNABORTED
> error until the FUSE filesystem finally umounted.
> 
> It is one of the requisites in production environment to provide
> uninterruptible filesystem service.  The most straightforward way, and
> maybe the most widely used way, is that make another dedicated user
> daemon (similar to systemd fdstore) keep the device fd open.  When the
> fuse daemon recovers from a crash, it can retrieve the device fd from the
> fdstore daemon through socket takeover (Unix domain socket) method [1]
> or pidfd_getfd() syscall [2].  In this way, as long as the fdstore
> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
> daemon crashes, though the filesystem service may hang there for a while
> when the fuse daemon gets restarted and has not been completely
> recovered yet.
> 
> This picture indeed works and has been deployed in our internal
> production environment until the following issues are encountered:
> 
> 1. The fdstore daemon may be killed by mistake, in which case the FUSE
> filesystem gets aborted and irrecoverable.

That's only a problem if you use the fdstore of the per-user instance.
The main fdstore is part of PID 1 and you can't kill that. So really,
systemd needs to hand the fds from the per-user instance to the main
fdstore.

> 2. In scenarios of containerized deployment, the fuse daemon is deployed
> in a container POD, and a dedicated fdstore daemon needs to be deployed
> for each fuse daemon.  The fdstore daemon could consume a amount of
> resources (e.g. memory footprint), which is not conducive to the dense
> container deployment.
> 
> 3. Each fuse daemon implementation needs to implement its own fdstore
> daemon.  If we implement the fuse recovery mechanism on the kernel side,
> all fuse daemon implementations could reuse this mechanism.

You can just the global fdstore. That is a design limitation not an
inherent limitation.
Christian Brauner May 28, 2024, 8:43 a.m. UTC | #8
On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
> 
> 
> On 2024/5/28 11:08, Jingbo Xu wrote:
> > 
> > 
> > On 5/28/24 10:45 AM, Jingbo Xu wrote:
> > > 
> > > 
> > > On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> > > > On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > > > 
> > > > > 3. I don't know if a kernel based recovery mechanism is welcome on the
> > > > > community side.  Any comment is welcome.  Thanks!
> > > > 
> > > > I'd prefer something external to fuse.
> > > 
> > > Okay, understood.
> > > 
> > > > 
> > > > Maybe a kernel based fdstore (lifetime connected to that of the
> > > > container) would a useful service more generally?
> > > 
> > > Yeah I indeed had considered this, but I'm afraid VFS guys would be
> > > concerned about why we do this on kernel side rather than in user space.
> 
> Just from my own perspective, even if it's in FUSE, the concern is
> almost the same.
> 
> I wonder if on-demand cachefiles can keep fds too in the future
> (thus e.g. daemonless feature could even be implemented entirely
> with kernel fdstore) but it still has the same concern or it's
> a source of duplication.
> 
> Thanks,
> Gao Xiang
> 
> > > 
> > > I'm not sure what the VFS guys think about this and if the kernel side
> > > shall care about this.

Fwiw, I'm not convinced and I think that's a big can of worms security
wise and semantics wise. I have discussed whether a kernel-side fdstore
would be something that systemd would use if available multiple times
and they wouldn't use it because it provides them with no benefits over
having it in userspace.

Especially since it implements a lot of special semantics and policy
that we really don't want in the kernel. I think that's just not
something we should do. We should give userspace all the means to
implement fdstores in userspace but not hold fds ourselves.
Gao Xiang May 28, 2024, 9:13 a.m. UTC | #9
Hi Christian,

On 2024/5/28 16:43, Christian Brauner wrote:
> On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
>>
>>
>> On 2024/5/28 11:08, Jingbo Xu wrote:
>>>
>>>
>>> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>>>
>>>>
>>>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>>>> community side.  Any comment is welcome.  Thanks!
>>>>>
>>>>> I'd prefer something external to fuse.
>>>>
>>>> Okay, understood.
>>>>
>>>>>
>>>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>>>> container) would a useful service more generally?
>>>>
>>>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>>>> concerned about why we do this on kernel side rather than in user space.
>>
>> Just from my own perspective, even if it's in FUSE, the concern is
>> almost the same.
>>
>> I wonder if on-demand cachefiles can keep fds too in the future
>> (thus e.g. daemonless feature could even be implemented entirely
>> with kernel fdstore) but it still has the same concern or it's
>> a source of duplication.
>>
>> Thanks,
>> Gao Xiang
>>
>>>>
>>>> I'm not sure what the VFS guys think about this and if the kernel side
>>>> shall care about this.
> 
> Fwiw, I'm not convinced and I think that's a big can of worms security
> wise and semantics wise. I have discussed whether a kernel-side fdstore
> would be something that systemd would use if available multiple times
> and they wouldn't use it because it provides them with no benefits over
> having it in userspace.

As far as I know, currently there are approximately two ways to do
failover mechanisms in kernel.

The first model much like a fuse-like model: in this mode, we should
keep and pass fd to maintain the active state.  And currently,
userspace should be responsible for the permission/security issues
when doing something like passing fds.

The second model is like one device-one instance model, for example
ublk (If I understand correctly): each active instance (/dev/ublkbX)
has their own unique control device (/dev/ublkcX).  Users could
assign/change DAC/MAC for each control device.  And failover
recovery just needs to reopen the control device with proper
permission and do recovery.

So just my own thought, kernel-side fdstore pseudo filesystem may
provide a DAC/MAC mechanism for the first model.  That is a much
cleaner way than doing some similar thing independently in each
subsystem which may need DAC/MAC-like mechanism.  But that is
just my own thought.

Thanks,
Gao Xiang

> 
> Especially since it implements a lot of special semantics and policy
> that we really don't want in the kernel. I think that's just not
> something we should do. We should give userspace all the means to
> implement fdstores in userspace but not hold fds ourselves.
Christian Brauner May 28, 2024, 9:32 a.m. UTC | #10
On Tue, May 28, 2024 at 05:13:04PM +0800, Gao Xiang wrote:
> Hi Christian,
> 
> On 2024/5/28 16:43, Christian Brauner wrote:
> > On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2024/5/28 11:08, Jingbo Xu wrote:
> > > > 
> > > > 
> > > > On 5/28/24 10:45 AM, Jingbo Xu wrote:
> > > > > 
> > > > > 
> > > > > On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> > > > > > On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > > > > > 
> > > > > > > 3. I don't know if a kernel based recovery mechanism is welcome on the
> > > > > > > community side.  Any comment is welcome.  Thanks!
> > > > > > 
> > > > > > I'd prefer something external to fuse.
> > > > > 
> > > > > Okay, understood.
> > > > > 
> > > > > > 
> > > > > > Maybe a kernel based fdstore (lifetime connected to that of the
> > > > > > container) would a useful service more generally?
> > > > > 
> > > > > Yeah I indeed had considered this, but I'm afraid VFS guys would be
> > > > > concerned about why we do this on kernel side rather than in user space.
> > > 
> > > Just from my own perspective, even if it's in FUSE, the concern is
> > > almost the same.
> > > 
> > > I wonder if on-demand cachefiles can keep fds too in the future
> > > (thus e.g. daemonless feature could even be implemented entirely
> > > with kernel fdstore) but it still has the same concern or it's
> > > a source of duplication.
> > > 
> > > Thanks,
> > > Gao Xiang
> > > 
> > > > > 
> > > > > I'm not sure what the VFS guys think about this and if the kernel side
> > > > > shall care about this.
> > 
> > Fwiw, I'm not convinced and I think that's a big can of worms security
> > wise and semantics wise. I have discussed whether a kernel-side fdstore
> > would be something that systemd would use if available multiple times
> > and they wouldn't use it because it provides them with no benefits over
> > having it in userspace.
> 
> As far as I know, currently there are approximately two ways to do
> failover mechanisms in kernel.
> 
> The first model much like a fuse-like model: in this mode, we should
> keep and pass fd to maintain the active state.  And currently,
> userspace should be responsible for the permission/security issues
> when doing something like passing fds.
> 
> The second model is like one device-one instance model, for example
> ublk (If I understand correctly): each active instance (/dev/ublkbX)
> has their own unique control device (/dev/ublkcX).  Users could
> assign/change DAC/MAC for each control device.  And failover
> recovery just needs to reopen the control device with proper
> permission and do recovery.
> 
> So just my own thought, kernel-side fdstore pseudo filesystem may
> provide a DAC/MAC mechanism for the first model.  That is a much
> cleaner way than doing some similar thing independently in each
> subsystem which may need DAC/MAC-like mechanism.  But that is
> just my own thought.

The failover mechanism for /dev/ublkcX could easily be implemented using
the fdstore. The fact that they rolled their own thing is orthogonal to
this imho. Implementing retrieval policies like this in the kernel is
slowly advancing into /proc/$pid/fd/ levels of complexity. That's all
better handled with appropriate policies in userspace. And cachefilesd
can similarly just stash their fds in the fdstore.
Jingbo Xu May 28, 2024, 9:45 a.m. UTC | #11
Hi, Christian,

Thanks for the review.


On 5/28/24 4:38 PM, Christian Brauner wrote:
> On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote:
>> Background
>> ==========
>> The fd of '/dev/fuse' serves as a message transmission channel between
>> FUSE filesystem (kernel space) and fuse server (user space). Once the
>> fd gets closed (intentionally or unintentionally), the FUSE filesystem
>> gets aborted, and any attempt of filesystem access gets -ECONNABORTED
>> error until the FUSE filesystem finally umounted.
>>
>> It is one of the requisites in production environment to provide
>> uninterruptible filesystem service.  The most straightforward way, and
>> maybe the most widely used way, is that make another dedicated user
>> daemon (similar to systemd fdstore) keep the device fd open.  When the
>> fuse daemon recovers from a crash, it can retrieve the device fd from the
>> fdstore daemon through socket takeover (Unix domain socket) method [1]
>> or pidfd_getfd() syscall [2].  In this way, as long as the fdstore
>> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
>> daemon crashes, though the filesystem service may hang there for a while
>> when the fuse daemon gets restarted and has not been completely
>> recovered yet.
>>
>> This picture indeed works and has been deployed in our internal
>> production environment until the following issues are encountered:
>>
>> 1. The fdstore daemon may be killed by mistake, in which case the FUSE
>> filesystem gets aborted and irrecoverable.
> 
> That's only a problem if you use the fdstore of the per-user instance.
> The main fdstore is part of PID 1 and you can't kill that. So really,
> systemd needs to hand the fds from the per-user instance to the main
> fdstore.

Systemd indeed has implemented its own fdstore mechanism in the user space.

Nowadays more and more fuse daemons are running inside containers, but a
container generally has no systemd inside it.
> 
>> 2. In scenarios of containerized deployment, the fuse daemon is deployed
>> in a container POD, and a dedicated fdstore daemon needs to be deployed
>> for each fuse daemon.  The fdstore daemon could consume a amount of
>> resources (e.g. memory footprint), which is not conducive to the dense
>> container deployment.
>>
>> 3. Each fuse daemon implementation needs to implement its own fdstore
>> daemon.  If we implement the fuse recovery mechanism on the kernel side,
>> all fuse daemon implementations could reuse this mechanism.
> 
> You can just the global fdstore. That is a design limitation not an
> inherent limitation.

What I initially mean is that each fuse daemon implementation (e.g.
s3fs, ossfs, and other vendors) needs to make its own but similar
mechanism for daemon failover.  There has not been a common component
for fdstore in container scenarios just like systemd fdstore.


I'd admit that it's controversial to implement a kernel-side fdstore.
Thus I only implement a failover mechanism for fuse server in this RFC
patch.  But I also understand Miklos's concern as what we really need to
support daemon failover is just something like fdstore to keep the
device fd alive.
Gao Xiang May 28, 2024, 9:58 a.m. UTC | #12
On 2024/5/28 17:32, Christian Brauner wrote:
> On Tue, May 28, 2024 at 05:13:04PM +0800, Gao Xiang wrote:
>> Hi Christian,
>>
>> On 2024/5/28 16:43, Christian Brauner wrote:
>>> On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2024/5/28 11:08, Jingbo Xu wrote:
>>>>>
>>>>>
>>>>> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>>>>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>
>>>>>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>>>>>> community side.  Any comment is welcome.  Thanks!
>>>>>>>
>>>>>>> I'd prefer something external to fuse.
>>>>>>
>>>>>> Okay, understood.
>>>>>>
>>>>>>>
>>>>>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>>>>>> container) would a useful service more generally?
>>>>>>
>>>>>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>>>>>> concerned about why we do this on kernel side rather than in user space.
>>>>
>>>> Just from my own perspective, even if it's in FUSE, the concern is
>>>> almost the same.
>>>>
>>>> I wonder if on-demand cachefiles can keep fds too in the future
>>>> (thus e.g. daemonless feature could even be implemented entirely
>>>> with kernel fdstore) but it still has the same concern or it's
>>>> a source of duplication.
>>>>
>>>> Thanks,
>>>> Gao Xiang
>>>>
>>>>>>
>>>>>> I'm not sure what the VFS guys think about this and if the kernel side
>>>>>> shall care about this.
>>>
>>> Fwiw, I'm not convinced and I think that's a big can of worms security
>>> wise and semantics wise. I have discussed whether a kernel-side fdstore
>>> would be something that systemd would use if available multiple times
>>> and they wouldn't use it because it provides them with no benefits over
>>> having it in userspace.
>>
>> As far as I know, currently there are approximately two ways to do
>> failover mechanisms in kernel.
>>
>> The first model much like a fuse-like model: in this mode, we should
>> keep and pass fd to maintain the active state.  And currently,
>> userspace should be responsible for the permission/security issues
>> when doing something like passing fds.
>>
>> The second model is like one device-one instance model, for example
>> ublk (If I understand correctly): each active instance (/dev/ublkbX)
>> has their own unique control device (/dev/ublkcX).  Users could
>> assign/change DAC/MAC for each control device.  And failover
>> recovery just needs to reopen the control device with proper
>> permission and do recovery.
>>
>> So just my own thought, kernel-side fdstore pseudo filesystem may
>> provide a DAC/MAC mechanism for the first model.  That is a much
>> cleaner way than doing some similar thing independently in each
>> subsystem which may need DAC/MAC-like mechanism.  But that is
>> just my own thought.
> 
> The failover mechanism for /dev/ublkcX could easily be implemented using
> the fdstore. The fact that they rolled their own thing is orthogonal to
> this imho. Implementing retrieval policies like this in the kernel is
> slowly advancing into /proc/$pid/fd/ levels of complexity. That's all
> better handled with appropriate policies in userspace. And cachefilesd
> can similarly just stash their fds in the fdstore.

Ok, got it.  I just would like to know what kernel fdstore
currently sounds like (since Miklos mentioned it so I wonder
if it's feasible since it can benefit to non-fuse cases).
I think userspace fdstore works for me (unless some other
interesting use cases for evaluation later).

Jingbo has an internal requirement for fuse, that is a pure
fuse stuff, and that is out of my scope though.

Thanks,
Gao Xiang