Message ID | 1566406146-7887-1-git-send-email-alex@zadara.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC-PATCH] nfsd: when unhashing openowners, increment openowner's refcount | expand |
Hi Bruce, Chuck, You are listed as maintainers of nfsd. Can you please take a look at the below patch? Thanks, Alex. On Wed, Aug 21, 2019 at 7:50 PM Alex Lyakas <alex@zadara.com> wrote: > > release_openowner() expects an extra refcnt taken for the openowner, > which it is releasing. > > With nfsd_inject_forget_client_openowners() and nfsd_inject_forget_openowners(), > we unhash openowners and collect them into a reaplist. Later we call > nfsd_reap_openowners(), which calls release_openowner(), which releases all openowner's stateids. > Each OPEN stateid holds a refcnt on the openowner. Therefore, after releasing > the last OPEN stateid via its sc_free function, which is nfs4_free_ol_stateid, > nfs4_put_stateowner() will be called, which will realize its the last > refcnt for the openowner. As a result, openowner will be freed. > But later, release_openowner() will go ahead and call release_last_closed_stateid() > and nfs4_put_stateowner() on the same openowner which was just released. > This corrupts memory and causes random crashes. > > After we fixed this, we confirmed that the openowner is not freed > prematurely. It is freed by release_openowner() final call > to nfs4_put_stateowner(). > > However, we still get (other) random crashes and memory corruptions > when nfsd_inject_forget_client_openowners() and > nfsd_inject_forget_openowners(). > According to our analysis, we don't see any other refcount issues. > Can anybody from the community review these flows for other potentials issues? > > Signed-off-by: Alex Lyakas <alex@zadara.com> > --- > fs/nfsd/nfs4state.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 7857942..4e9afca 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -7251,6 +7251,7 @@ static u64 nfsd_foreach_client_lock(struct nfs4_client *clp, u64 max, > func(oop); > if (collect) { > atomic_inc(&clp->cl_rpc_users); > + nfs4_get_stateowner(&oop->oo_owner); > list_add(&oop->oo_perclient, collect); > } > } > -- > 1.9.1 >
On Sun, Aug 25, 2019 at 01:12:34PM +0300, Alex Lyakas wrote: > You are listed as maintainers of nfsd. Can you please take a look at > the below patch? Thanks! I take it this was found by some kind of code analysis or fuzzing, not use in production? Asking because I've been considering just deprecating it, so: > > After we fixed this, we confirmed that the openowner is not freed > > prematurely. It is freed by release_openowner() final call > > to nfs4_put_stateowner(). > > > > However, we still get (other) random crashes and memory corruptions > > when nfsd_inject_forget_client_openowners() and > > nfsd_inject_forget_openowners(). > > According to our analysis, we don't see any other refcount issues. > > Can anybody from the community review these flows for other potentials issues? I'm wondering how much effort we want to put into tracking all that down. --b.
On Mon, Aug 26, 2019 at 09:39:51AM -0400, J. Bruce Fields wrote: > On Sun, Aug 25, 2019 at 01:12:34PM +0300, Alex Lyakas wrote: > > You are listed as maintainers of nfsd. Can you please take a look at > > the below patch? > > Thanks! > > I take it this was found by some kind of code analysis or fuzzing, not > use in production? > > Asking because I've been considering just deprecating it, so: So, unless someone objects I'd like to queue this up for 5.4. --b. commit 9d60d93198c6 Author: J. Bruce Fields <bfields@redhat.com> Date: Mon Aug 26 10:28:58 2019 -0400 Deprecate nfsd fault injection This is only useful for client testing. I haven't really maintained it, and reference counting and locking are wrong at this point. You can get some of the same functionality now from nfsd/clients/. It was a good idea but I think its time has passed. In the unlikely event of users, hopefully the BROKEN dependency will prompt them to speak up. Otherwise I expect to remove it soon. Reported-by: Alex Lyakas <alex@zadara.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff --git a/fs/nfsd/Kconfig b/fs/nfsd/Kconfig index bff8456220e0..10cefb0c07c7 100644 --- a/fs/nfsd/Kconfig +++ b/fs/nfsd/Kconfig @@ -148,7 +148,7 @@ config NFSD_V4_SECURITY_LABEL config NFSD_FAULT_INJECTION bool "NFS server manual fault injection" - depends on NFSD_V4 && DEBUG_KERNEL && DEBUG_FS + depends on NFSD_V4 && DEBUG_KERNEL && DEBUG_FS && BROKEN help This option enables support for manually injecting faults into the NFS server. This is intended to be used for
Hi Bruce, Thank you for your response. On Mon, Aug 26, 2019 at 4:39 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > On Sun, Aug 25, 2019 at 01:12:34PM +0300, Alex Lyakas wrote: > > You are listed as maintainers of nfsd. Can you please take a look at > > the below patch? > > Thanks! > > I take it this was found by some kind of code analysis or fuzzing, not > use in production? We are hitting the following issue in production quite frequently: - We create two local file systems FS1 and FS2 on the server machine S - We export both FS1 and FS2 through nfsd to the same nfs client, running on client machine C - On C, we mount both exported file systems and start writing files to both of them - After few minutes, on server machine S, we un-export FS1 only. We don't unmount FS1 on the client machine C prior to un-exporting. Also, FS2 remains exported to C. - We want to unmount FS1 on the server machine S, but we fail, because there are still open files on FS1 by nfsd. Debugging this issue showed the following root cause: we have a nfs4_client entry for the client C. This entry has two nfs4_openowners, for FS1 and FS2, although FS1 was un-exported. Looking at the stateids of both openowners, we see that they have stateids of kind NFS4_OPEN_STID, and each stateid is holding a nfs4_file. The reason we cannot unmount FS1, is because we still have an openowner for FS1, holding open-stateids, which hold open files on FS1. The laundromat doesn't help in this case, because it can only decide per-nfs4_client that it should be purged. But in this case, since FS2 is still exported to C, there is no reason to purge the nfs4_client. This situation remains until we un-export FS2 as well. Then the whole nfs4_client is purged, and all the files get closed, and we can unmount both FS1 and FS2. We started looking around, and we found the failure injection code that can "forget openowners". We wrote some custom code that allows us to select the openowner which is not needed anymore. And then we unhash this openowner, we unhash and close all of its stateids. Then the files get closed, and we can unmount FS1. Is the described issue familiar to you? It is very easily reproducible. What is the way to solve it? To our understanding, if we un-export a FS from nfsd, we should be able to unmount it. For example, can we introduce a sysfs or procfs entry that will list all clients and openowners. Then we add another sysfs entry allowing the user to "forget" a particular openowner? If you feel this is the way to move forward, we can try to provide patches for review. Thanks, Alex. > > Asking because I've been considering just deprecating it, so: > > > > After we fixed this, we confirmed that the openowner is not freed > > > prematurely. It is freed by release_openowner() final call > > > to nfs4_put_stateowner(). > > > > > > However, we still get (other) random crashes and memory corruptions > > > when nfsd_inject_forget_client_openowners() and > > > nfsd_inject_forget_openowners(). > > > According to our analysis, we don't see any other refcount issues. > > > Can anybody from the community review these flows for other potentials issues? > > I'm wondering how much effort we want to put into tracking all that > down. > > --b.
On Tue, Aug 27, 2019 at 12:05:28PM +0300, Alex Lyakas wrote: > Is the described issue familiar to you? Yep, got it, but I haven't seen anyone try to solve it using the fault injection code, that's interesting! There's also fs/nfsd/unlock_filesystem. It only unlocks NLM (NFSv3) locks. But it'd probably be reasonable to teach it to get NFSv4 state too (locks, opens, delegations, and layouts). But my feeling's always been that the cleanest way to do it is to create two containers with separate net namespaces and run nfsd in both of them. You can start and stop the servers in the different containers independently. > It is very easily reproducible. What is the way to solve it? To our > understanding, if we un-export a FS from nfsd, we should be able to > unmount it. Unexporting has never removed locks or opens or other state, for what it's worth. --b.
Hi Bruce, Thanks for the clarifications. On Tue, Aug 27, 2019 at 11:51 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > On Tue, Aug 27, 2019 at 12:05:28PM +0300, Alex Lyakas wrote: > > Is the described issue familiar to you? > > Yep, got it, but I haven't seen anyone try to solve it using the fault > injection code, that's interesting! > > There's also fs/nfsd/unlock_filesystem. It only unlocks NLM (NFSv3) > locks. But it'd probably be reasonable to teach it to get NFSv4 state > too (locks, opens, delegations, and layouts). > > But my feeling's always been that the cleanest way to do it is to create > two containers with separate net namespaces and run nfsd in both of > them. You can start and stop the servers in the different containers > independently. I am looking at the code, and currently nfsd creates a single namespace subsystem in init_nfsd. All nfs4_clients run in this subsystem. So the proposal is to use register_pernet_subsys() for every filesystem that is exported? I presume that current nfsd code cannot do this, and some rework is required to move away from a single subsystem to per-export subsystem. Also, grepping through kernel code, I see that namespace subsystems are created by different modules as part of module initialization, rather than doing that dynamically. Furthermore, in our case the same nfsd machine S can export tens or even hundreds of local filesystems.Is this fine to have hundreds of subsystems? Otherwise, I understand that the current behavior is a "won't fix", and it is expected for the client machine to unmount the export before un-exporting the file system at nfsd machine. Is this correct? Thanks, Alex. > > > It is very easily reproducible. What is the way to solve it? To our > > understanding, if we un-export a FS from nfsd, we should be able to > > unmount it. > > Unexporting has never removed locks or opens or other state, for what > it's worth. > > --b.
On Wed, Aug 28, 2019 at 06:20:22PM +0300, Alex Lyakas wrote: > On Tue, Aug 27, 2019 at 11:51 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > > > On Tue, Aug 27, 2019 at 12:05:28PM +0300, Alex Lyakas wrote: > > > Is the described issue familiar to you? > > > > Yep, got it, but I haven't seen anyone try to solve it using the fault > > injection code, that's interesting! > > > > There's also fs/nfsd/unlock_filesystem. It only unlocks NLM (NFSv3) > > locks. But it'd probably be reasonable to teach it to get NFSv4 state > > too (locks, opens, delegations, and layouts). > > > > But my feeling's always been that the cleanest way to do it is to create > > two containers with separate net namespaces and run nfsd in both of > > them. You can start and stop the servers in the different containers > > independently. > > I am looking at the code, and currently nfsd creates a single > namespace subsystem in init_nfsd. All nfs4_clients run in this > subsystem. > > So the proposal is to use register_pernet_subsys() for every > filesystem that is exported? No, I'm proposing any krenel changes. Just create separate net namespaces from userspace and start nfsd from within them. And you'll also need to arrange for them different nfsd's to get different exports. In practice, the best way to do this may be using some container management service, I'm not sure. > I presume that current nfsd code cannot > do this, and some rework is required to move away from a single > subsystem to per-export subsystem. Also, grepping through kernel code, > I see that namespace subsystems are created by different modules as > part of module initialization, rather than doing that dynamically. > Furthermore, in our case the same nfsd machine S can export tens or > even hundreds of local filesystems.Is this fine to have hundreds of > subsystems? I haven't done it myself, but I suspect hundreds of containers should be OK. It may depend on available resources, of course. > Otherwise, I understand that the current behavior is a "won't fix", > and it is expected for the client machine to unmount the export before > un-exporting the file system at nfsd machine. Is this correct? You're definitely not the only ones to request this, so I'd like to have a working solution. My preference would be to try the namespace/container approach first. And if that turns out no to work well for some reason, to update fs/nfsd/unlock_filesystem to handle NFSv4 stuff. The fault injection code isn't the right interface for this. Even if we did decide it was worth fixing up and maintaining--it's really only designed for testing clients. I'd expect distros not to build it in their default kernels. --b.
Hi Bruce, On Wed, Aug 28, 2019 at 7:54 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > On Wed, Aug 28, 2019 at 06:20:22PM +0300, Alex Lyakas wrote: > > On Tue, Aug 27, 2019 at 11:51 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > > > > > On Tue, Aug 27, 2019 at 12:05:28PM +0300, Alex Lyakas wrote: > > > > Is the described issue familiar to you? > > > > > > Yep, got it, but I haven't seen anyone try to solve it using the fault > > > injection code, that's interesting! > > > > > > There's also fs/nfsd/unlock_filesystem. It only unlocks NLM (NFSv3) > > > locks. But it'd probably be reasonable to teach it to get NFSv4 state > > > too (locks, opens, delegations, and layouts). > > > > > > But my feeling's always been that the cleanest way to do it is to create > > > two containers with separate net namespaces and run nfsd in both of > > > them. You can start and stop the servers in the different containers > > > independently. > > > > I am looking at the code, and currently nfsd creates a single > > namespace subsystem in init_nfsd. All nfs4_clients run in this > > subsystem. > > > > So the proposal is to use register_pernet_subsys() for every > > filesystem that is exported? > > No, I'm proposing any krenel changes. Just create separate net > namespaces from userspace and start nfsd from within them. And you'll > also need to arrange for them different nfsd's to get different exports. > > In practice, the best way to do this may be using some container > management service, I'm not sure. > > > I presume that current nfsd code cannot > > do this, and some rework is required to move away from a single > > subsystem to per-export subsystem. Also, grepping through kernel code, > > I see that namespace subsystems are created by different modules as > > part of module initialization, rather than doing that dynamically. > > Furthermore, in our case the same nfsd machine S can export tens or > > even hundreds of local filesystems.Is this fine to have hundreds of > > subsystems? > > I haven't done it myself, but I suspect hundreds of containers should be > OK. It may depend on available resources, of course. > > > Otherwise, I understand that the current behavior is a "won't fix", > > and it is expected for the client machine to unmount the export before > > un-exporting the file system at nfsd machine. Is this correct? > > You're definitely not the only ones to request this, so I'd like to have > a working solution. > > My preference would be to try the namespace/container approach first. > And if that turns out no to work well for some reason, to update > fs/nfsd/unlock_filesystem to handle NFSv4 stuff. > > The fault injection code isn't the right interface for this. Even if we > did decide it was worth fixing up and maintaining--it's really only > designed for testing clients. I'd expect distros not to build it in > their default kernels. Hi Bruce, We evaluated the network namespaces approach. But, unfortunately, it doesn't fit easily into how our system is currently structured. We would have to create and configure interfaces for every namespace, and have a separate IP address (presumably) for every namespace. All this seems a bit of an overkill, to just have several local filesystems exported to the same client (which is when we hit the issue). I would assume that some other users would argue as well that creating a separate network namespace for every local filesystem is not the way to go from the administration point of view. Regarding the failure injection code, we did not actually enable and use it. We instead wrote some custom code that is highly modeled after the failure injection code. And that's how we found the openowner refcnt issue, from which this thread started. Similarly to the failure injection code, our code iterates over all the clients, all the openowners, all the OPEN-stateids and the opened files. Then for each opened file, it decides whether this file belongs to the file system that has been just un-exported. To accomplish that, we added a custom field to nfs4_file: struct nfs4_file { atomic_t fi_ref; spinlock_t fi_lock; ... #ifdef CONFIG_NFSD_ZADARA u64 fi_share_id; /* which exported FS this file belongs to; non-zero */ #endif /*CONFIG_NFSD_ZADARA*/ }; We also added some simple data structure that tracks all the exported file systems, and assigns a unique "share_id" for each.In find_or_add_file(), we consult the custom data structure that we added, and assign a "fi_share_id" for every nfs4_file that is newly-created. The code that "forgets" all the relevant openowners looks like: void forget_openowners_of_share_id(uint64 share_id) { spin_lock(&nn->client_lock); list_for_each_entry(clp, &nn->client_lru, cl_lru) { spin_lock(&clp->cl_lock); list_for_each_entry_safe(oo, oo_next, &clp->cl_openowners, oo_perclient) { list_for_each_entry(stp, &oo->oo_owner.so_stateids, st_perstateowner) { if (stp->st_stid.sc_file->fi_share_id == share_id) { // This file belongs to relevant share_id ... } } if (openowner has only files from this "share_id") { // we don't expect same openowner to have files from different share_ids nfs4_get_stateowner(&oo->oo_owner);// take extra refcnt unhash_openowner_locked(oo); atomic_inc(&clp->cl_refcount); list_add(&oo->oo_perclient, &openowner_reaplist); } } spin_unlock(&clp->cl_lock); } spin_unlock(&nn->client_lock); nfsd_reap_openowners(&reaplist); } So this code is very similar to the existing failure injection code. It's just that the criteria for openowner selection is different. Currently this code is invoked from a custom procfs entry, by user-space application, before unmounting the local file system. Would moving this code into the "unlock_filesystem" infrastructure be acceptable? Since the "share_id" approach is very custom for our usage, what criteria would you suggest for selecting the openowners to be "forgotten"? Thanks, Alex.
On Thu, Aug 29, 2019 at 09:12:49PM +0300, Alex Lyakas wrote: > We evaluated the network namespaces approach. But, unfortunately, it > doesn't fit easily into how our system is currently structured. We > would have to create and configure interfaces for every namespace, and > have a separate IP address (presumably) for every namespace. Yes. > All this > seems a bit of an overkill, to just have several local filesystems > exported to the same client (which is when we hit the issue). I would > assume that some other users would argue as well that creating a > separate network namespace for every local filesystem is not the way > to go from the administration point of view. OK, makes sense. And I take it you don't want to go around to each client and shut things down cleanly. And you're fine with the client applications erroring out when you yank their filesystem out from underneath them. (I wonder what happens these days when that happens on a linux client when there are dirty pages. I think you may just end up with a useless mount that you can't get rid of till you power down the client.) > Regarding the failure injection code, we did not actually enable and > use it. We instead wrote some custom code that is highly modeled after > the failure injection code. Sounds interesting.... I'll try to understand it and give some comments later. ... > Currently this code is invoked from a custom procfs entry, by > user-space application, before unmounting the local file system. > > Would moving this code into the "unlock_filesystem" infrastructure be > acceptable? Yes. I'd be interested in patches. > Since the "share_id" approach is very custom for our > usage, what criteria would you suggest for selecting the openowners to > be "forgotten"? The share_id shouldn't be necessary. I'll think about it. --b.
On Thu, Aug 29, 2019 at 09:12:49PM +0300, Alex Lyakas wrote: > Would moving this code into the "unlock_filesystem" infrastructure be > acceptable? Since the "share_id" approach is very custom for our > usage, what criteria would you suggest for selecting the openowners to > be "forgotten"? Have you looked at what unlock_filesystem()? It's just translating the given path to a superblock, then matching that against inodes in nlmsvc_match_sb(). It's a little more complicated for nfs4_files since they don't have a pointer to the inode. (Maybe it should.) You can see how I get around this in e.g. fs/nfsd/nfs4state.c:nfs4_show_lock(). A superblock isn't the same thing as an export, thanks to bind mounts and subdirectory exports. But if the goal is to be able to unmount, then a superblock is probably what you want. --b.
Hi Bruce, On Fri, Aug 30, 2019 at 10:08 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > On Thu, Aug 29, 2019 at 09:12:49PM +0300, Alex Lyakas wrote: > > We evaluated the network namespaces approach. But, unfortunately, it > > doesn't fit easily into how our system is currently structured. We > > would have to create and configure interfaces for every namespace, and > > have a separate IP address (presumably) for every namespace. > > Yes. > > > All this > > seems a bit of an overkill, to just have several local filesystems > > exported to the same client (which is when we hit the issue). I would > > assume that some other users would argue as well that creating a > > separate network namespace for every local filesystem is not the way > > to go from the administration point of view. > > OK, makes sense. > > And I take it you don't want to go around to each client and shut things > down cleanly. And you're fine with the client applications erroring out > when you yank their filesystem out from underneath them. It's not that we don't want to unmount at each client. The problem is that we don't control the client machines, as they are owned by customers. We definitely recommend customers to unmount, before un-exporting. But in some cases it still doesn't happen, most notably in automated environments. Since the un-export operation is initiated by customer, I presume she understands that the nfs client might error out, if not unmounted properly beforehand. > > (I wonder what happens these days when that happens on a linux client > when there are dirty pages. I think you may just end up with a useless > mount that you can't get rid of till you power down the client.) Right, in some cases, the IO process gets stuck forever. > > > Regarding the failure injection code, we did not actually enable and > > use it. We instead wrote some custom code that is highly modeled after > > the failure injection code. > > Sounds interesting.... I'll try to understand it and give some comments > later. > ... > > Currently this code is invoked from a custom procfs entry, by > > user-space application, before unmounting the local file system. > > > > Would moving this code into the "unlock_filesystem" infrastructure be > > acceptable? > > Yes. I'd be interested in patches. > > > Since the "share_id" approach is very custom for our > > usage, what criteria would you suggest for selecting the openowners to > > be "forgotten"? > > The share_id shouldn't be necessary. I'll think about it. > > --b.
Hi Bruce, On Fri, Aug 30, 2019 at 10:54 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > On Thu, Aug 29, 2019 at 09:12:49PM +0300, Alex Lyakas wrote: > > Would moving this code into the "unlock_filesystem" infrastructure be > > acceptable? Since the "share_id" approach is very custom for our > > usage, what criteria would you suggest for selecting the openowners to > > be "forgotten"? > > Have you looked at what unlock_filesystem()? It's just translating the > given path to a superblock, then matching that against inodes in > nlmsvc_match_sb(). > > It's a little more complicated for nfs4_files since they don't have a > pointer to the inode. (Maybe it should.) You can see how I get around > this in e.g. fs/nfsd/nfs4state.c:nfs4_show_lock(). > > A superblock isn't the same thing as an export, thanks to bind mounts > and subdirectory exports. But if the goal is to be able to unmount, > then a superblock is probably what you want. Thanks for your suggestion.The superblock approach works very well for us. Initial patch is on its way. Thanks, Alex. > > --b.
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 7857942..4e9afca 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -7251,6 +7251,7 @@ static u64 nfsd_foreach_client_lock(struct nfs4_client *clp, u64 max, func(oop); if (collect) { atomic_inc(&clp->cl_rpc_users); + nfs4_get_stateowner(&oop->oo_owner); list_add(&oop->oo_perclient, collect); } }
release_openowner() expects an extra refcnt taken for the openowner, which it is releasing. With nfsd_inject_forget_client_openowners() and nfsd_inject_forget_openowners(), we unhash openowners and collect them into a reaplist. Later we call nfsd_reap_openowners(), which calls release_openowner(), which releases all openowner's stateids. Each OPEN stateid holds a refcnt on the openowner. Therefore, after releasing the last OPEN stateid via its sc_free function, which is nfs4_free_ol_stateid, nfs4_put_stateowner() will be called, which will realize its the last refcnt for the openowner. As a result, openowner will be freed. But later, release_openowner() will go ahead and call release_last_closed_stateid() and nfs4_put_stateowner() on the same openowner which was just released. This corrupts memory and causes random crashes. After we fixed this, we confirmed that the openowner is not freed prematurely. It is freed by release_openowner() final call to nfs4_put_stateowner(). However, we still get (other) random crashes and memory corruptions when nfsd_inject_forget_client_openowners() and nfsd_inject_forget_openowners(). According to our analysis, we don't see any other refcount issues. Can anybody from the community review these flows for other potentials issues? Signed-off-by: Alex Lyakas <alex@zadara.com> --- fs/nfsd/nfs4state.c | 1 + 1 file changed, 1 insertion(+)