Message ID | 20240524064030.4944-1-jefflexu@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | fuse: introduce fuse server recovery mechanism | expand |
On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > 3. I don't know if a kernel based recovery mechanism is welcome on the > community side. Any comment is welcome. Thanks! I'd prefer something external to fuse. Maybe a kernel based fdstore (lifetime connected to that of the container) would a useful service more generally? Thanks, Miklos
On 5/27/24 11:16 PM, Miklos Szeredi wrote: > On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > >> 3. I don't know if a kernel based recovery mechanism is welcome on the >> community side. Any comment is welcome. Thanks! > > I'd prefer something external to fuse. Okay, understood. > > Maybe a kernel based fdstore (lifetime connected to that of the > container) would a useful service more generally? Yeah I indeed had considered this, but I'm afraid VFS guys would be concerned about why we do this on kernel side rather than in user space. I'm not sure what the VFS guys think about this and if the kernel side shall care about this. Many thanks!
On 5/28/24 10:45 AM, Jingbo Xu wrote: > > > On 5/27/24 11:16 PM, Miklos Szeredi wrote: >> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >> >>> 3. I don't know if a kernel based recovery mechanism is welcome on the >>> community side. Any comment is welcome. Thanks! >> >> I'd prefer something external to fuse. > > Okay, understood. > >> >> Maybe a kernel based fdstore (lifetime connected to that of the >> container) would a useful service more generally? > > Yeah I indeed had considered this, but I'm afraid VFS guys would be > concerned about why we do this on kernel side rather than in user space. > > I'm not sure what the VFS guys think about this and if the kernel side > shall care about this. > There was an RFC for kernel-side fdstore [1], though it's also implemented upon FUSE. [1] https://lore.kernel.org/all/CA+a=Yy5rnqLqH2iR-ZY6AUkNJy48mroVV3Exmhmt-pfTi82kXA@mail.gmail.com/T/
On 2024/5/28 11:08, Jingbo Xu wrote: > > > On 5/28/24 10:45 AM, Jingbo Xu wrote: >> >> >> On 5/27/24 11:16 PM, Miklos Szeredi wrote: >>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >>> >>>> 3. I don't know if a kernel based recovery mechanism is welcome on the >>>> community side. Any comment is welcome. Thanks! >>> >>> I'd prefer something external to fuse. >> >> Okay, understood. >> >>> >>> Maybe a kernel based fdstore (lifetime connected to that of the >>> container) would a useful service more generally? >> >> Yeah I indeed had considered this, but I'm afraid VFS guys would be >> concerned about why we do this on kernel side rather than in user space. Just from my own perspective, even if it's in FUSE, the concern is almost the same. I wonder if on-demand cachefiles can keep fds too in the future (thus e.g. daemonless feature could even be implemented entirely with kernel fdstore) but it still has the same concern or it's a source of duplication. Thanks, Gao Xiang >> >> I'm not sure what the VFS guys think about this and if the kernel side >> shall care about this. >> > > There was an RFC for kernel-side fdstore [1], though it's also > implemented upon FUSE. > > [1] > https://lore.kernel.org/all/CA+a=Yy5rnqLqH2iR-ZY6AUkNJy48mroVV3Exmhmt-pfTi82kXA@mail.gmail.com/T/ > > >
On Tue, 28 May 2024 at 04:45, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > Yeah I indeed had considered this, but I'm afraid VFS guys would be > concerned about why we do this on kernel side rather than in user space. > > I'm not sure what the VFS guys think about this and if the kernel side > shall care about this. Yes, that is indeed something that needs to be discussed. I often find, that when discussing something like this a lot of good ideas can come from different directions, so it can help move things forward. Try something really simple first, and post a patch. Don't overthink the first version. Thanks, Miklos
On Tue, 28 May 2024 at 05:08, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > There was an RFC for kernel-side fdstore [1], though it's also > implemented upon FUSE. I strongly believe that this needs to be disassociated from fuse. It could be a pseudo filesystem, though. Thanks, Miklos
On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote: > Background > ========== > The fd of '/dev/fuse' serves as a message transmission channel between > FUSE filesystem (kernel space) and fuse server (user space). Once the > fd gets closed (intentionally or unintentionally), the FUSE filesystem > gets aborted, and any attempt of filesystem access gets -ECONNABORTED > error until the FUSE filesystem finally umounted. > > It is one of the requisites in production environment to provide > uninterruptible filesystem service. The most straightforward way, and > maybe the most widely used way, is that make another dedicated user > daemon (similar to systemd fdstore) keep the device fd open. When the > fuse daemon recovers from a crash, it can retrieve the device fd from the > fdstore daemon through socket takeover (Unix domain socket) method [1] > or pidfd_getfd() syscall [2]. In this way, as long as the fdstore > daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse > daemon crashes, though the filesystem service may hang there for a while > when the fuse daemon gets restarted and has not been completely > recovered yet. > > This picture indeed works and has been deployed in our internal > production environment until the following issues are encountered: > > 1. The fdstore daemon may be killed by mistake, in which case the FUSE > filesystem gets aborted and irrecoverable. That's only a problem if you use the fdstore of the per-user instance. The main fdstore is part of PID 1 and you can't kill that. So really, systemd needs to hand the fds from the per-user instance to the main fdstore. > 2. In scenarios of containerized deployment, the fuse daemon is deployed > in a container POD, and a dedicated fdstore daemon needs to be deployed > for each fuse daemon. The fdstore daemon could consume a amount of > resources (e.g. memory footprint), which is not conducive to the dense > container deployment. > > 3. Each fuse daemon implementation needs to implement its own fdstore > daemon. If we implement the fuse recovery mechanism on the kernel side, > all fuse daemon implementations could reuse this mechanism. You can just the global fdstore. That is a design limitation not an inherent limitation.
On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote: > > > On 2024/5/28 11:08, Jingbo Xu wrote: > > > > > > On 5/28/24 10:45 AM, Jingbo Xu wrote: > > > > > > > > > On 5/27/24 11:16 PM, Miklos Szeredi wrote: > > > > On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > > > > > > > > 3. I don't know if a kernel based recovery mechanism is welcome on the > > > > > community side. Any comment is welcome. Thanks! > > > > > > > > I'd prefer something external to fuse. > > > > > > Okay, understood. > > > > > > > > > > > Maybe a kernel based fdstore (lifetime connected to that of the > > > > container) would a useful service more generally? > > > > > > Yeah I indeed had considered this, but I'm afraid VFS guys would be > > > concerned about why we do this on kernel side rather than in user space. > > Just from my own perspective, even if it's in FUSE, the concern is > almost the same. > > I wonder if on-demand cachefiles can keep fds too in the future > (thus e.g. daemonless feature could even be implemented entirely > with kernel fdstore) but it still has the same concern or it's > a source of duplication. > > Thanks, > Gao Xiang > > > > > > > I'm not sure what the VFS guys think about this and if the kernel side > > > shall care about this. Fwiw, I'm not convinced and I think that's a big can of worms security wise and semantics wise. I have discussed whether a kernel-side fdstore would be something that systemd would use if available multiple times and they wouldn't use it because it provides them with no benefits over having it in userspace. Especially since it implements a lot of special semantics and policy that we really don't want in the kernel. I think that's just not something we should do. We should give userspace all the means to implement fdstores in userspace but not hold fds ourselves.
Hi Christian, On 2024/5/28 16:43, Christian Brauner wrote: > On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote: >> >> >> On 2024/5/28 11:08, Jingbo Xu wrote: >>> >>> >>> On 5/28/24 10:45 AM, Jingbo Xu wrote: >>>> >>>> >>>> On 5/27/24 11:16 PM, Miklos Szeredi wrote: >>>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >>>>> >>>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the >>>>>> community side. Any comment is welcome. Thanks! >>>>> >>>>> I'd prefer something external to fuse. >>>> >>>> Okay, understood. >>>> >>>>> >>>>> Maybe a kernel based fdstore (lifetime connected to that of the >>>>> container) would a useful service more generally? >>>> >>>> Yeah I indeed had considered this, but I'm afraid VFS guys would be >>>> concerned about why we do this on kernel side rather than in user space. >> >> Just from my own perspective, even if it's in FUSE, the concern is >> almost the same. >> >> I wonder if on-demand cachefiles can keep fds too in the future >> (thus e.g. daemonless feature could even be implemented entirely >> with kernel fdstore) but it still has the same concern or it's >> a source of duplication. >> >> Thanks, >> Gao Xiang >> >>>> >>>> I'm not sure what the VFS guys think about this and if the kernel side >>>> shall care about this. > > Fwiw, I'm not convinced and I think that's a big can of worms security > wise and semantics wise. I have discussed whether a kernel-side fdstore > would be something that systemd would use if available multiple times > and they wouldn't use it because it provides them with no benefits over > having it in userspace. As far as I know, currently there are approximately two ways to do failover mechanisms in kernel. The first model much like a fuse-like model: in this mode, we should keep and pass fd to maintain the active state. And currently, userspace should be responsible for the permission/security issues when doing something like passing fds. The second model is like one device-one instance model, for example ublk (If I understand correctly): each active instance (/dev/ublkbX) has their own unique control device (/dev/ublkcX). Users could assign/change DAC/MAC for each control device. And failover recovery just needs to reopen the control device with proper permission and do recovery. So just my own thought, kernel-side fdstore pseudo filesystem may provide a DAC/MAC mechanism for the first model. That is a much cleaner way than doing some similar thing independently in each subsystem which may need DAC/MAC-like mechanism. But that is just my own thought. Thanks, Gao Xiang > > Especially since it implements a lot of special semantics and policy > that we really don't want in the kernel. I think that's just not > something we should do. We should give userspace all the means to > implement fdstores in userspace but not hold fds ourselves.
On Tue, May 28, 2024 at 05:13:04PM +0800, Gao Xiang wrote: > Hi Christian, > > On 2024/5/28 16:43, Christian Brauner wrote: > > On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote: > > > > > > > > > On 2024/5/28 11:08, Jingbo Xu wrote: > > > > > > > > > > > > On 5/28/24 10:45 AM, Jingbo Xu wrote: > > > > > > > > > > > > > > > On 5/27/24 11:16 PM, Miklos Szeredi wrote: > > > > > > On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > > > > > > > > > > > > 3. I don't know if a kernel based recovery mechanism is welcome on the > > > > > > > community side. Any comment is welcome. Thanks! > > > > > > > > > > > > I'd prefer something external to fuse. > > > > > > > > > > Okay, understood. > > > > > > > > > > > > > > > > > Maybe a kernel based fdstore (lifetime connected to that of the > > > > > > container) would a useful service more generally? > > > > > > > > > > Yeah I indeed had considered this, but I'm afraid VFS guys would be > > > > > concerned about why we do this on kernel side rather than in user space. > > > > > > Just from my own perspective, even if it's in FUSE, the concern is > > > almost the same. > > > > > > I wonder if on-demand cachefiles can keep fds too in the future > > > (thus e.g. daemonless feature could even be implemented entirely > > > with kernel fdstore) but it still has the same concern or it's > > > a source of duplication. > > > > > > Thanks, > > > Gao Xiang > > > > > > > > > > > > > I'm not sure what the VFS guys think about this and if the kernel side > > > > > shall care about this. > > > > Fwiw, I'm not convinced and I think that's a big can of worms security > > wise and semantics wise. I have discussed whether a kernel-side fdstore > > would be something that systemd would use if available multiple times > > and they wouldn't use it because it provides them with no benefits over > > having it in userspace. > > As far as I know, currently there are approximately two ways to do > failover mechanisms in kernel. > > The first model much like a fuse-like model: in this mode, we should > keep and pass fd to maintain the active state. And currently, > userspace should be responsible for the permission/security issues > when doing something like passing fds. > > The second model is like one device-one instance model, for example > ublk (If I understand correctly): each active instance (/dev/ublkbX) > has their own unique control device (/dev/ublkcX). Users could > assign/change DAC/MAC for each control device. And failover > recovery just needs to reopen the control device with proper > permission and do recovery. > > So just my own thought, kernel-side fdstore pseudo filesystem may > provide a DAC/MAC mechanism for the first model. That is a much > cleaner way than doing some similar thing independently in each > subsystem which may need DAC/MAC-like mechanism. But that is > just my own thought. The failover mechanism for /dev/ublkcX could easily be implemented using the fdstore. The fact that they rolled their own thing is orthogonal to this imho. Implementing retrieval policies like this in the kernel is slowly advancing into /proc/$pid/fd/ levels of complexity. That's all better handled with appropriate policies in userspace. And cachefilesd can similarly just stash their fds in the fdstore.
Hi, Christian, Thanks for the review. On 5/28/24 4:38 PM, Christian Brauner wrote: > On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote: >> Background >> ========== >> The fd of '/dev/fuse' serves as a message transmission channel between >> FUSE filesystem (kernel space) and fuse server (user space). Once the >> fd gets closed (intentionally or unintentionally), the FUSE filesystem >> gets aborted, and any attempt of filesystem access gets -ECONNABORTED >> error until the FUSE filesystem finally umounted. >> >> It is one of the requisites in production environment to provide >> uninterruptible filesystem service. The most straightforward way, and >> maybe the most widely used way, is that make another dedicated user >> daemon (similar to systemd fdstore) keep the device fd open. When the >> fuse daemon recovers from a crash, it can retrieve the device fd from the >> fdstore daemon through socket takeover (Unix domain socket) method [1] >> or pidfd_getfd() syscall [2]. In this way, as long as the fdstore >> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse >> daemon crashes, though the filesystem service may hang there for a while >> when the fuse daemon gets restarted and has not been completely >> recovered yet. >> >> This picture indeed works and has been deployed in our internal >> production environment until the following issues are encountered: >> >> 1. The fdstore daemon may be killed by mistake, in which case the FUSE >> filesystem gets aborted and irrecoverable. > > That's only a problem if you use the fdstore of the per-user instance. > The main fdstore is part of PID 1 and you can't kill that. So really, > systemd needs to hand the fds from the per-user instance to the main > fdstore. Systemd indeed has implemented its own fdstore mechanism in the user space. Nowadays more and more fuse daemons are running inside containers, but a container generally has no systemd inside it. > >> 2. In scenarios of containerized deployment, the fuse daemon is deployed >> in a container POD, and a dedicated fdstore daemon needs to be deployed >> for each fuse daemon. The fdstore daemon could consume a amount of >> resources (e.g. memory footprint), which is not conducive to the dense >> container deployment. >> >> 3. Each fuse daemon implementation needs to implement its own fdstore >> daemon. If we implement the fuse recovery mechanism on the kernel side, >> all fuse daemon implementations could reuse this mechanism. > > You can just the global fdstore. That is a design limitation not an > inherent limitation. What I initially mean is that each fuse daemon implementation (e.g. s3fs, ossfs, and other vendors) needs to make its own but similar mechanism for daemon failover. There has not been a common component for fdstore in container scenarios just like systemd fdstore. I'd admit that it's controversial to implement a kernel-side fdstore. Thus I only implement a failover mechanism for fuse server in this RFC patch. But I also understand Miklos's concern as what we really need to support daemon failover is just something like fdstore to keep the device fd alive.
On 2024/5/28 17:32, Christian Brauner wrote: > On Tue, May 28, 2024 at 05:13:04PM +0800, Gao Xiang wrote: >> Hi Christian, >> >> On 2024/5/28 16:43, Christian Brauner wrote: >>> On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote: >>>> >>>> >>>> On 2024/5/28 11:08, Jingbo Xu wrote: >>>>> >>>>> >>>>> On 5/28/24 10:45 AM, Jingbo Xu wrote: >>>>>> >>>>>> >>>>>> On 5/27/24 11:16 PM, Miklos Szeredi wrote: >>>>>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >>>>>>> >>>>>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the >>>>>>>> community side. Any comment is welcome. Thanks! >>>>>>> >>>>>>> I'd prefer something external to fuse. >>>>>> >>>>>> Okay, understood. >>>>>> >>>>>>> >>>>>>> Maybe a kernel based fdstore (lifetime connected to that of the >>>>>>> container) would a useful service more generally? >>>>>> >>>>>> Yeah I indeed had considered this, but I'm afraid VFS guys would be >>>>>> concerned about why we do this on kernel side rather than in user space. >>>> >>>> Just from my own perspective, even if it's in FUSE, the concern is >>>> almost the same. >>>> >>>> I wonder if on-demand cachefiles can keep fds too in the future >>>> (thus e.g. daemonless feature could even be implemented entirely >>>> with kernel fdstore) but it still has the same concern or it's >>>> a source of duplication. >>>> >>>> Thanks, >>>> Gao Xiang >>>> >>>>>> >>>>>> I'm not sure what the VFS guys think about this and if the kernel side >>>>>> shall care about this. >>> >>> Fwiw, I'm not convinced and I think that's a big can of worms security >>> wise and semantics wise. I have discussed whether a kernel-side fdstore >>> would be something that systemd would use if available multiple times >>> and they wouldn't use it because it provides them with no benefits over >>> having it in userspace. >> >> As far as I know, currently there are approximately two ways to do >> failover mechanisms in kernel. >> >> The first model much like a fuse-like model: in this mode, we should >> keep and pass fd to maintain the active state. And currently, >> userspace should be responsible for the permission/security issues >> when doing something like passing fds. >> >> The second model is like one device-one instance model, for example >> ublk (If I understand correctly): each active instance (/dev/ublkbX) >> has their own unique control device (/dev/ublkcX). Users could >> assign/change DAC/MAC for each control device. And failover >> recovery just needs to reopen the control device with proper >> permission and do recovery. >> >> So just my own thought, kernel-side fdstore pseudo filesystem may >> provide a DAC/MAC mechanism for the first model. That is a much >> cleaner way than doing some similar thing independently in each >> subsystem which may need DAC/MAC-like mechanism. But that is >> just my own thought. > > The failover mechanism for /dev/ublkcX could easily be implemented using > the fdstore. The fact that they rolled their own thing is orthogonal to > this imho. Implementing retrieval policies like this in the kernel is > slowly advancing into /proc/$pid/fd/ levels of complexity. That's all > better handled with appropriate policies in userspace. And cachefilesd > can similarly just stash their fds in the fdstore. Ok, got it. I just would like to know what kernel fdstore currently sounds like (since Miklos mentioned it so I wonder if it's feasible since it can benefit to non-fuse cases). I think userspace fdstore works for me (unless some other interesting use cases for evaluation later). Jingbo has an internal requirement for fuse, that is a pure fuse stuff, and that is out of my scope though. Thanks, Gao Xiang