diff mbox

Quota-enabled XFS hangs during mount

Message ID 20170127170734.GA49571@bfoster.bfoster (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Brian Foster Jan. 27, 2017, 5:07 p.m. UTC
On Fri, Jan 27, 2017 at 02:06:45PM +0100, Martin Svec wrote:
> Dne 26.1.2017 v 20:12 Brian Foster napsal(a):
> > On Thu, Jan 26, 2017 at 06:46:42PM +0100, Martin Svec wrote:
> >> Hello,
> >>
> >> Dne 25.1.2017 v 23:17 Brian Foster napsal(a):
> >>> On Tue, Jan 24, 2017 at 02:17:36PM +0100, Martin Svec wrote:
> >>>> Hello,
> >>>>
> >>>> Dne 23.1.2017 v 14:44 Brian Foster napsal(a):
> >>>>> On Mon, Jan 23, 2017 at 10:44:20AM +0100, Martin Svec wrote:
> >>>>>> Hello Dave,
> >>>>>>
> >>>>>> Any updates on this? It's a bit annoying to workaround the bug by increasing RAM just because of the
> >>>>>> initial quotacheck.
> >>>>>>
> >>>>> Note that Dave is away on a bit of an extended vacation[1]. It looks
> >>>>> like he was in the process of fishing through the code to spot any
> >>>>> potential problems related to quotacheck+reclaim. I see you've cc'd him
> >>>>> directly so we'll see if we get a response wrt to if he got anywhere
> >>>>> with that...
> >>>>>
> >>>>> Skimming back through this thread, it looks like we have an issue where
> >>>>> quota check is not quite reliable in the event of reclaim, and you
> >>>>> appear to be reproducing this due to a probably unique combination of
> >>>>> large inode count and low memory.
> >>>>>
> >>>>> Is my understanding correct that you've reproduced this on more recent
> >>>>> kernels than the original report? 
> >>>> Yes, I repeated the tests using 4.9.3 kernel on another VM where we hit this issue.
> >>>>
> >>>> Configuration:
> >>>> * vSphere 5.5 virtual machine, 2 vCPUs, virtual disks residing on iSCSI VMFS datastore
> >>>> * Debian Jessie 64 bit webserver, vanilla kernel 4.9.3
> >>>> * 180 GB XFS data disk mounted as /www
> >>>>
> >>>> Quotacheck behavior depends on assigned RAM:
> >>>> * 2 or less GiB: mount /www leads to a storm of OOM kills including shell, ttys etc., so the system
> >>>> becomes unusable.
> >>>> * 3 GiB: mount /www task hangs in the same way as I reported in earlier in this thread.
> >>>> * 4 or more GiB: mount /www succeeds.
> >>>>
> >>> I was able to reproduce the quotacheck OOM situation on latest kernels.
> >>> This problem actually looks like a regression as of commit 17c12bcd3
> >>> ("xfs: when replaying bmap operations, don't let unlinked inodes get
> >>> reaped"), but I don't think that patch is the core problem. That patch
> >>> pulled up setting MS_ACTIVE on the superblock from after XFS runs
> >>> quotacheck to before it (for other reasons), which has a side effect of
> >>> causing inodes to be placed onto the lru once they are released. Before
> >>> this change, all inodes were immediately marked for reclaim once
> >>> released from quotacheck because the superblock had not been set active.
> >>>
> >>> The problem here is first that quotacheck issues a bulkstat and thus
> >>> grabs and releases every inode in the fs. The quotacheck occurs at mount
> >>> time, which means we still hold the s_umount lock and thus the shrinker
> >>> cannot run even though it is registered. Therefore, we basically just
> >>> populate the lru until we've consumed too much memory and blow up.
> >>>
> >>> I think the solution here is to preserve the quotacheck behavior prior
> >>> to commit 17c12bcd3 via something like the following:
> >>>
> >>> --- a/fs/xfs/xfs_qm.c
> >>> +++ b/fs/xfs/xfs_qm.c
> >>> @@ -1177,7 +1177,7 @@ xfs_qm_dqusage_adjust(
> >>>  	 * the case in all other instances. It's OK that we do this because
> >>>  	 * quotacheck is done only at mount time.
> >>>  	 */
> >>> -	error = xfs_iget(mp, NULL, ino, 0, XFS_ILOCK_EXCL, &ip);
> >>> +	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, XFS_ILOCK_EXCL, &ip);
> >>>  	if (error) {
> >>>  		*res = BULKSTAT_RV_NOTHING;
> >>>  		return error;
> >>>
> >>> ... which allows quotacheck to run as normal in my quick tests. Could
> >>> you try this on your more recent kernel tests and see whether you still
> >>> reproduce any problems?
> >> The above patch fixes OOM issues and reduces overall memory consumption during quotacheck. However,
> >> it does not fix the original xfs_qm_flush_one() freezing. I'm still able to reproduce it with 1 GB
> >> of RAM or lower. Tested with 4.9.5 kernel.
> >>
> > Ok, thanks. I'll get that fix posted shortly.
> >
> > I hadn't tried reducing RAM any further. I dropped my vm down to 1GB and
> > I don't reproduce a hang. If I drop to 512MB, the mount actually crashes
> > due to what looks like the problem that djwong just fixed[1].
> >
> > With that one liner applied, it does look like I've hit a mount hang in
> > the quotacheck path. Note that I'm also running into OOM issues again
> > though, probably due to legitimately not having enough RAM for this vm.
> > Anyways, I'll see if I can dig anything out of that...
> >
> > FWIW, this is all on the latest for-next (4.10.0-rc5).
> >
> > [1] http://www.spinics.net/lists/linux-xfs/msg03869.html
> >
> >> If it makes sense to you, I can rsync the whole filesystem to a new XFS volume and repeat the tests.
> >> At least, that could tell us if the problem depends on a particular state of on-disk metadata
> >> structures or it's a general property of the given filesystem tree.
> >>
> > That couldn't hurt, thanks.
> >
> 
> Well, after rsync to a fresh non-resized XFS volume, I still hit the mount hang with 1GB RAM.
> 

The problem looks like a race between dquot reclaim and quotacheck. The
high level sequence of events is as follows:

 - During quotacheck, xfs_qm_dqiterate() walks the physical dquot
   buffers and queues them to the delwri queue.
 - Next, kswapd kicks in and attempts to reclaim a dquot that is backed
   by a buffer on the quotacheck delwri queue. xfs_qm_dquot_isolate()
   acquires the flush lock and attempts to queue to the reclaim delwri
   queue. This silently fails because the buffer is already queued.

   From this point forward, the dquot flush lock is not going to be
   released until the buffer is submitted for I/O and completed via
   quotacheck.
 - Quotacheck continues on to the xfs_qm_flush_one() pass, hits the
   dquot in question and waits on the flush lock to issue the flush of
   the recalculated values. *deadlock*

There are at least a few ways to deal with this. We could do something
granular to fix up the reclaim path to check whether the buffer is
already queued or something of that nature before we actually invoke the
flush. I think this is effectively pointless, however, because the first
part of quotacheck walks and queues all physical dquot buffers anyways.

In other words, I think dquot reclaim during quotacheck should probably
be bypassed. Given that, we could either adjust when the shrinker is
registered until after quotacheck or set a flag somewhere to cause dquot
reclaim to back out when quotacheck is running. I opted for something
like the latter. Care to test the appended patch?

Note that I think this does mean that you could still have low memory
issues if you happen to have a lot of quotas defined..

Brian

---8<---

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Martin Svec Jan. 27, 2017, 8:49 p.m. UTC | #1
Dne 27.1.2017 v 18:07 Brian Foster napsal(a):
> On Fri, Jan 27, 2017 at 02:06:45PM +0100, Martin Svec wrote:
>> Dne 26.1.2017 v 20:12 Brian Foster napsal(a):
>>> On Thu, Jan 26, 2017 at 06:46:42PM +0100, Martin Svec wrote:
>>>> Hello,
>>>>
>>>> Dne 25.1.2017 v 23:17 Brian Foster napsal(a):
>>>>> On Tue, Jan 24, 2017 at 02:17:36PM +0100, Martin Svec wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Dne 23.1.2017 v 14:44 Brian Foster napsal(a):
>>>>>>> On Mon, Jan 23, 2017 at 10:44:20AM +0100, Martin Svec wrote:
>>>>>>>> Hello Dave,
>>>>>>>>
>>>>>>>> Any updates on this? It's a bit annoying to workaround the bug by increasing RAM just because of the
>>>>>>>> initial quotacheck.
>>>>>>>>
>>>>>>> Note that Dave is away on a bit of an extended vacation[1]. It looks
>>>>>>> like he was in the process of fishing through the code to spot any
>>>>>>> potential problems related to quotacheck+reclaim. I see you've cc'd him
>>>>>>> directly so we'll see if we get a response wrt to if he got anywhere
>>>>>>> with that...
>>>>>>>
>>>>>>> Skimming back through this thread, it looks like we have an issue where
>>>>>>> quota check is not quite reliable in the event of reclaim, and you
>>>>>>> appear to be reproducing this due to a probably unique combination of
>>>>>>> large inode count and low memory.
>>>>>>>
>>>>>>> Is my understanding correct that you've reproduced this on more recent
>>>>>>> kernels than the original report? 
>>>>>> Yes, I repeated the tests using 4.9.3 kernel on another VM where we hit this issue.
>>>>>>
>>>>>> Configuration:
>>>>>> * vSphere 5.5 virtual machine, 2 vCPUs, virtual disks residing on iSCSI VMFS datastore
>>>>>> * Debian Jessie 64 bit webserver, vanilla kernel 4.9.3
>>>>>> * 180 GB XFS data disk mounted as /www
>>>>>>
>>>>>> Quotacheck behavior depends on assigned RAM:
>>>>>> * 2 or less GiB: mount /www leads to a storm of OOM kills including shell, ttys etc., so the system
>>>>>> becomes unusable.
>>>>>> * 3 GiB: mount /www task hangs in the same way as I reported in earlier in this thread.
>>>>>> * 4 or more GiB: mount /www succeeds.
>>>>>>
>>>>> I was able to reproduce the quotacheck OOM situation on latest kernels.
>>>>> This problem actually looks like a regression as of commit 17c12bcd3
>>>>> ("xfs: when replaying bmap operations, don't let unlinked inodes get
>>>>> reaped"), but I don't think that patch is the core problem. That patch
>>>>> pulled up setting MS_ACTIVE on the superblock from after XFS runs
>>>>> quotacheck to before it (for other reasons), which has a side effect of
>>>>> causing inodes to be placed onto the lru once they are released. Before
>>>>> this change, all inodes were immediately marked for reclaim once
>>>>> released from quotacheck because the superblock had not been set active.
>>>>>
>>>>> The problem here is first that quotacheck issues a bulkstat and thus
>>>>> grabs and releases every inode in the fs. The quotacheck occurs at mount
>>>>> time, which means we still hold the s_umount lock and thus the shrinker
>>>>> cannot run even though it is registered. Therefore, we basically just
>>>>> populate the lru until we've consumed too much memory and blow up.
>>>>>
>>>>> I think the solution here is to preserve the quotacheck behavior prior
>>>>> to commit 17c12bcd3 via something like the following:
>>>>>
>>>>> --- a/fs/xfs/xfs_qm.c
>>>>> +++ b/fs/xfs/xfs_qm.c
>>>>> @@ -1177,7 +1177,7 @@ xfs_qm_dqusage_adjust(
>>>>>  	 * the case in all other instances. It's OK that we do this because
>>>>>  	 * quotacheck is done only at mount time.
>>>>>  	 */
>>>>> -	error = xfs_iget(mp, NULL, ino, 0, XFS_ILOCK_EXCL, &ip);
>>>>> +	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, XFS_ILOCK_EXCL, &ip);
>>>>>  	if (error) {
>>>>>  		*res = BULKSTAT_RV_NOTHING;
>>>>>  		return error;
>>>>>
>>>>> ... which allows quotacheck to run as normal in my quick tests. Could
>>>>> you try this on your more recent kernel tests and see whether you still
>>>>> reproduce any problems?
>>>> The above patch fixes OOM issues and reduces overall memory consumption during quotacheck. However,
>>>> it does not fix the original xfs_qm_flush_one() freezing. I'm still able to reproduce it with 1 GB
>>>> of RAM or lower. Tested with 4.9.5 kernel.
>>>>
>>> Ok, thanks. I'll get that fix posted shortly.
>>>
>>> I hadn't tried reducing RAM any further. I dropped my vm down to 1GB and
>>> I don't reproduce a hang. If I drop to 512MB, the mount actually crashes
>>> due to what looks like the problem that djwong just fixed[1].
>>>
>>> With that one liner applied, it does look like I've hit a mount hang in
>>> the quotacheck path. Note that I'm also running into OOM issues again
>>> though, probably due to legitimately not having enough RAM for this vm.
>>> Anyways, I'll see if I can dig anything out of that...
>>>
>>> FWIW, this is all on the latest for-next (4.10.0-rc5).
>>>
>>> [1] http://www.spinics.net/lists/linux-xfs/msg03869.html
>>>
>>>> If it makes sense to you, I can rsync the whole filesystem to a new XFS volume and repeat the tests.
>>>> At least, that could tell us if the problem depends on a particular state of on-disk metadata
>>>> structures or it's a general property of the given filesystem tree.
>>>>
>>> That couldn't hurt, thanks.
>>>
>> Well, after rsync to a fresh non-resized XFS volume, I still hit the mount hang with 1GB RAM.
>>
> The problem looks like a race between dquot reclaim and quotacheck. The
> high level sequence of events is as follows:
>
>  - During quotacheck, xfs_qm_dqiterate() walks the physical dquot
>    buffers and queues them to the delwri queue.
>  - Next, kswapd kicks in and attempts to reclaim a dquot that is backed
>    by a buffer on the quotacheck delwri queue. xfs_qm_dquot_isolate()
>    acquires the flush lock and attempts to queue to the reclaim delwri
>    queue. This silently fails because the buffer is already queued.
>
>    From this point forward, the dquot flush lock is not going to be
>    released until the buffer is submitted for I/O and completed via
>    quotacheck.
>  - Quotacheck continues on to the xfs_qm_flush_one() pass, hits the
>    dquot in question and waits on the flush lock to issue the flush of
>    the recalculated values. *deadlock*
>
> There are at least a few ways to deal with this. We could do something
> granular to fix up the reclaim path to check whether the buffer is
> already queued or something of that nature before we actually invoke the
> flush. I think this is effectively pointless, however, because the first
> part of quotacheck walks and queues all physical dquot buffers anyways.
>
> In other words, I think dquot reclaim during quotacheck should probably
> be bypassed. Given that, we could either adjust when the shrinker is
> registered until after quotacheck or set a flag somewhere to cause dquot
> reclaim to back out when quotacheck is running. I opted for something
> like the latter. Care to test the appended patch?
>
> Note that I think this does mean that you could still have low memory
> issues if you happen to have a lot of quotas defined..
>

Looks good, no more hangs with 1 GB. Thank you, Brian.

If I further reduce RAM to 512 MB, mount succeeds too but multiple "BUG: Bad page state in process
mount" errors are reported. Is it one of the expected low memory issues?

Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Svec Jan. 27, 2017, 9 p.m. UTC | #2
Dne 27.1.2017 v 21:49 Martin Svec napsal(a):
> Dne 27.1.2017 v 18:07 Brian Foster napsal(a):
>> On Fri, Jan 27, 2017 at 02:06:45PM +0100, Martin Svec wrote:
>>> Dne 26.1.2017 v 20:12 Brian Foster napsal(a):
>>>> On Thu, Jan 26, 2017 at 06:46:42PM +0100, Martin Svec wrote:
>>>>> Hello,
>>>>>
>>>>> Dne 25.1.2017 v 23:17 Brian Foster napsal(a):
>>>>>> On Tue, Jan 24, 2017 at 02:17:36PM +0100, Martin Svec wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> Dne 23.1.2017 v 14:44 Brian Foster napsal(a):
>>>>>>>> On Mon, Jan 23, 2017 at 10:44:20AM +0100, Martin Svec wrote:
>>>>>>>>> Hello Dave,
>>>>>>>>>
>>>>>>>>> Any updates on this? It's a bit annoying to workaround the bug by increasing RAM just because of the
>>>>>>>>> initial quotacheck.
>>>>>>>>>
>>>>>>>> Note that Dave is away on a bit of an extended vacation[1]. It looks
>>>>>>>> like he was in the process of fishing through the code to spot any
>>>>>>>> potential problems related to quotacheck+reclaim. I see you've cc'd him
>>>>>>>> directly so we'll see if we get a response wrt to if he got anywhere
>>>>>>>> with that...
>>>>>>>>
>>>>>>>> Skimming back through this thread, it looks like we have an issue where
>>>>>>>> quota check is not quite reliable in the event of reclaim, and you
>>>>>>>> appear to be reproducing this due to a probably unique combination of
>>>>>>>> large inode count and low memory.
>>>>>>>>
>>>>>>>> Is my understanding correct that you've reproduced this on more recent
>>>>>>>> kernels than the original report? 
>>>>>>> Yes, I repeated the tests using 4.9.3 kernel on another VM where we hit this issue.
>>>>>>>
>>>>>>> Configuration:
>>>>>>> * vSphere 5.5 virtual machine, 2 vCPUs, virtual disks residing on iSCSI VMFS datastore
>>>>>>> * Debian Jessie 64 bit webserver, vanilla kernel 4.9.3
>>>>>>> * 180 GB XFS data disk mounted as /www
>>>>>>>
>>>>>>> Quotacheck behavior depends on assigned RAM:
>>>>>>> * 2 or less GiB: mount /www leads to a storm of OOM kills including shell, ttys etc., so the system
>>>>>>> becomes unusable.
>>>>>>> * 3 GiB: mount /www task hangs in the same way as I reported in earlier in this thread.
>>>>>>> * 4 or more GiB: mount /www succeeds.
>>>>>>>
>>>>>> I was able to reproduce the quotacheck OOM situation on latest kernels.
>>>>>> This problem actually looks like a regression as of commit 17c12bcd3
>>>>>> ("xfs: when replaying bmap operations, don't let unlinked inodes get
>>>>>> reaped"), but I don't think that patch is the core problem. That patch
>>>>>> pulled up setting MS_ACTIVE on the superblock from after XFS runs
>>>>>> quotacheck to before it (for other reasons), which has a side effect of
>>>>>> causing inodes to be placed onto the lru once they are released. Before
>>>>>> this change, all inodes were immediately marked for reclaim once
>>>>>> released from quotacheck because the superblock had not been set active.
>>>>>>
>>>>>> The problem here is first that quotacheck issues a bulkstat and thus
>>>>>> grabs and releases every inode in the fs. The quotacheck occurs at mount
>>>>>> time, which means we still hold the s_umount lock and thus the shrinker
>>>>>> cannot run even though it is registered. Therefore, we basically just
>>>>>> populate the lru until we've consumed too much memory and blow up.
>>>>>>
>>>>>> I think the solution here is to preserve the quotacheck behavior prior
>>>>>> to commit 17c12bcd3 via something like the following:
>>>>>>
>>>>>> --- a/fs/xfs/xfs_qm.c
>>>>>> +++ b/fs/xfs/xfs_qm.c
>>>>>> @@ -1177,7 +1177,7 @@ xfs_qm_dqusage_adjust(
>>>>>>  	 * the case in all other instances. It's OK that we do this because
>>>>>>  	 * quotacheck is done only at mount time.
>>>>>>  	 */
>>>>>> -	error = xfs_iget(mp, NULL, ino, 0, XFS_ILOCK_EXCL, &ip);
>>>>>> +	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, XFS_ILOCK_EXCL, &ip);
>>>>>>  	if (error) {
>>>>>>  		*res = BULKSTAT_RV_NOTHING;
>>>>>>  		return error;
>>>>>>
>>>>>> ... which allows quotacheck to run as normal in my quick tests. Could
>>>>>> you try this on your more recent kernel tests and see whether you still
>>>>>> reproduce any problems?
>>>>> The above patch fixes OOM issues and reduces overall memory consumption during quotacheck. However,
>>>>> it does not fix the original xfs_qm_flush_one() freezing. I'm still able to reproduce it with 1 GB
>>>>> of RAM or lower. Tested with 4.9.5 kernel.
>>>>>
>>>> Ok, thanks. I'll get that fix posted shortly.
>>>>
>>>> I hadn't tried reducing RAM any further. I dropped my vm down to 1GB and
>>>> I don't reproduce a hang. If I drop to 512MB, the mount actually crashes
>>>> due to what looks like the problem that djwong just fixed[1].
>>>>
>>>> With that one liner applied, it does look like I've hit a mount hang in
>>>> the quotacheck path. Note that I'm also running into OOM issues again
>>>> though, probably due to legitimately not having enough RAM for this vm.
>>>> Anyways, I'll see if I can dig anything out of that...
>>>>
>>>> FWIW, this is all on the latest for-next (4.10.0-rc5).
>>>>
>>>> [1] http://www.spinics.net/lists/linux-xfs/msg03869.html
>>>>
>>>>> If it makes sense to you, I can rsync the whole filesystem to a new XFS volume and repeat the tests.
>>>>> At least, that could tell us if the problem depends on a particular state of on-disk metadata
>>>>> structures or it's a general property of the given filesystem tree.
>>>>>
>>>> That couldn't hurt, thanks.
>>>>
>>> Well, after rsync to a fresh non-resized XFS volume, I still hit the mount hang with 1GB RAM.
>>>
>> The problem looks like a race between dquot reclaim and quotacheck. The
>> high level sequence of events is as follows:
>>
>>  - During quotacheck, xfs_qm_dqiterate() walks the physical dquot
>>    buffers and queues them to the delwri queue.
>>  - Next, kswapd kicks in and attempts to reclaim a dquot that is backed
>>    by a buffer on the quotacheck delwri queue. xfs_qm_dquot_isolate()
>>    acquires the flush lock and attempts to queue to the reclaim delwri
>>    queue. This silently fails because the buffer is already queued.
>>
>>    From this point forward, the dquot flush lock is not going to be
>>    released until the buffer is submitted for I/O and completed via
>>    quotacheck.
>>  - Quotacheck continues on to the xfs_qm_flush_one() pass, hits the
>>    dquot in question and waits on the flush lock to issue the flush of
>>    the recalculated values. *deadlock*
>>
>> There are at least a few ways to deal with this. We could do something
>> granular to fix up the reclaim path to check whether the buffer is
>> already queued or something of that nature before we actually invoke the
>> flush. I think this is effectively pointless, however, because the first
>> part of quotacheck walks and queues all physical dquot buffers anyways.
>>
>> In other words, I think dquot reclaim during quotacheck should probably
>> be bypassed. Given that, we could either adjust when the shrinker is
>> registered until after quotacheck or set a flag somewhere to cause dquot
>> reclaim to back out when quotacheck is running. I opted for something
>> like the latter. Care to test the appended patch?
>>
>> Note that I think this does mean that you could still have low memory
>> issues if you happen to have a lot of quotas defined..
>>
> Looks good, no more hangs with 1 GB. Thank you, Brian.
>
> If I further reduce RAM to 512 MB, mount succeeds too but multiple "BUG: Bad page state in process
> mount" errors are reported. Is it one of the expected low memory issues?
>
> Martin
>
Well, reading back through this thread, this might be related to patch [1] which I didn't apply to
4.9.5. I'll retry it next week.

[1] http://www.spinics.net/lists/linux-xfs/msg03869.html

Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Jan. 27, 2017, 11:17 p.m. UTC | #3
On Fri, Jan 27, 2017 at 10:00:43PM +0100, Martin Svec wrote:
> Dne 27.1.2017 v 21:49 Martin Svec napsal(a):
> > Dne 27.1.2017 v 18:07 Brian Foster napsal(a):
> >> On Fri, Jan 27, 2017 at 02:06:45PM +0100, Martin Svec wrote:
> >>> Dne 26.1.2017 v 20:12 Brian Foster napsal(a):
> >>>> On Thu, Jan 26, 2017 at 06:46:42PM +0100, Martin Svec wrote:
> >>>>> Hello,
> >>>>>
> >>>>> Dne 25.1.2017 v 23:17 Brian Foster napsal(a):
> >>>>>> On Tue, Jan 24, 2017 at 02:17:36PM +0100, Martin Svec wrote:
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> Dne 23.1.2017 v 14:44 Brian Foster napsal(a):
> >>>>>>>> On Mon, Jan 23, 2017 at 10:44:20AM +0100, Martin Svec wrote:
> >>>>>>>>> Hello Dave,
> >>>>>>>>>
> >>>>>>>>> Any updates on this? It's a bit annoying to workaround the bug by increasing RAM just because of the
> >>>>>>>>> initial quotacheck.
> >>>>>>>>>
> >>>>>>>> Note that Dave is away on a bit of an extended vacation[1]. It looks
> >>>>>>>> like he was in the process of fishing through the code to spot any
> >>>>>>>> potential problems related to quotacheck+reclaim. I see you've cc'd him
> >>>>>>>> directly so we'll see if we get a response wrt to if he got anywhere
> >>>>>>>> with that...
> >>>>>>>>
> >>>>>>>> Skimming back through this thread, it looks like we have an issue where
> >>>>>>>> quota check is not quite reliable in the event of reclaim, and you
> >>>>>>>> appear to be reproducing this due to a probably unique combination of
> >>>>>>>> large inode count and low memory.
> >>>>>>>>
> >>>>>>>> Is my understanding correct that you've reproduced this on more recent
> >>>>>>>> kernels than the original report? 
> >>>>>>> Yes, I repeated the tests using 4.9.3 kernel on another VM where we hit this issue.
> >>>>>>>
> >>>>>>> Configuration:
> >>>>>>> * vSphere 5.5 virtual machine, 2 vCPUs, virtual disks residing on iSCSI VMFS datastore
> >>>>>>> * Debian Jessie 64 bit webserver, vanilla kernel 4.9.3
> >>>>>>> * 180 GB XFS data disk mounted as /www
> >>>>>>>
> >>>>>>> Quotacheck behavior depends on assigned RAM:
> >>>>>>> * 2 or less GiB: mount /www leads to a storm of OOM kills including shell, ttys etc., so the system
> >>>>>>> becomes unusable.
> >>>>>>> * 3 GiB: mount /www task hangs in the same way as I reported in earlier in this thread.
> >>>>>>> * 4 or more GiB: mount /www succeeds.
> >>>>>>>
> >>>>>> I was able to reproduce the quotacheck OOM situation on latest kernels.
> >>>>>> This problem actually looks like a regression as of commit 17c12bcd3
> >>>>>> ("xfs: when replaying bmap operations, don't let unlinked inodes get
> >>>>>> reaped"), but I don't think that patch is the core problem. That patch
> >>>>>> pulled up setting MS_ACTIVE on the superblock from after XFS runs
> >>>>>> quotacheck to before it (for other reasons), which has a side effect of
> >>>>>> causing inodes to be placed onto the lru once they are released. Before
> >>>>>> this change, all inodes were immediately marked for reclaim once
> >>>>>> released from quotacheck because the superblock had not been set active.
> >>>>>>
> >>>>>> The problem here is first that quotacheck issues a bulkstat and thus
> >>>>>> grabs and releases every inode in the fs. The quotacheck occurs at mount
> >>>>>> time, which means we still hold the s_umount lock and thus the shrinker
> >>>>>> cannot run even though it is registered. Therefore, we basically just
> >>>>>> populate the lru until we've consumed too much memory and blow up.
> >>>>>>
> >>>>>> I think the solution here is to preserve the quotacheck behavior prior
> >>>>>> to commit 17c12bcd3 via something like the following:
> >>>>>>
> >>>>>> --- a/fs/xfs/xfs_qm.c
> >>>>>> +++ b/fs/xfs/xfs_qm.c
> >>>>>> @@ -1177,7 +1177,7 @@ xfs_qm_dqusage_adjust(
> >>>>>>  	 * the case in all other instances. It's OK that we do this because
> >>>>>>  	 * quotacheck is done only at mount time.
> >>>>>>  	 */
> >>>>>> -	error = xfs_iget(mp, NULL, ino, 0, XFS_ILOCK_EXCL, &ip);
> >>>>>> +	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, XFS_ILOCK_EXCL, &ip);
> >>>>>>  	if (error) {
> >>>>>>  		*res = BULKSTAT_RV_NOTHING;
> >>>>>>  		return error;
> >>>>>>
> >>>>>> ... which allows quotacheck to run as normal in my quick tests. Could
> >>>>>> you try this on your more recent kernel tests and see whether you still
> >>>>>> reproduce any problems?
> >>>>> The above patch fixes OOM issues and reduces overall memory consumption during quotacheck. However,
> >>>>> it does not fix the original xfs_qm_flush_one() freezing. I'm still able to reproduce it with 1 GB
> >>>>> of RAM or lower. Tested with 4.9.5 kernel.
> >>>>>
> >>>> Ok, thanks. I'll get that fix posted shortly.
> >>>>
> >>>> I hadn't tried reducing RAM any further. I dropped my vm down to 1GB and
> >>>> I don't reproduce a hang. If I drop to 512MB, the mount actually crashes
> >>>> due to what looks like the problem that djwong just fixed[1].
> >>>>
> >>>> With that one liner applied, it does look like I've hit a mount hang in
> >>>> the quotacheck path. Note that I'm also running into OOM issues again
> >>>> though, probably due to legitimately not having enough RAM for this vm.
> >>>> Anyways, I'll see if I can dig anything out of that...
> >>>>
> >>>> FWIW, this is all on the latest for-next (4.10.0-rc5).
> >>>>
> >>>> [1] http://www.spinics.net/lists/linux-xfs/msg03869.html
> >>>>
> >>>>> If it makes sense to you, I can rsync the whole filesystem to a new XFS volume and repeat the tests.
> >>>>> At least, that could tell us if the problem depends on a particular state of on-disk metadata
> >>>>> structures or it's a general property of the given filesystem tree.
> >>>>>
> >>>> That couldn't hurt, thanks.
> >>>>
> >>> Well, after rsync to a fresh non-resized XFS volume, I still hit the mount hang with 1GB RAM.
> >>>
> >> The problem looks like a race between dquot reclaim and quotacheck. The
> >> high level sequence of events is as follows:
> >>
> >>  - During quotacheck, xfs_qm_dqiterate() walks the physical dquot
> >>    buffers and queues them to the delwri queue.
> >>  - Next, kswapd kicks in and attempts to reclaim a dquot that is backed
> >>    by a buffer on the quotacheck delwri queue. xfs_qm_dquot_isolate()
> >>    acquires the flush lock and attempts to queue to the reclaim delwri
> >>    queue. This silently fails because the buffer is already queued.
> >>
> >>    From this point forward, the dquot flush lock is not going to be
> >>    released until the buffer is submitted for I/O and completed via
> >>    quotacheck.
> >>  - Quotacheck continues on to the xfs_qm_flush_one() pass, hits the
> >>    dquot in question and waits on the flush lock to issue the flush of
> >>    the recalculated values. *deadlock*
> >>
> >> There are at least a few ways to deal with this. We could do something
> >> granular to fix up the reclaim path to check whether the buffer is
> >> already queued or something of that nature before we actually invoke the
> >> flush. I think this is effectively pointless, however, because the first
> >> part of quotacheck walks and queues all physical dquot buffers anyways.
> >>
> >> In other words, I think dquot reclaim during quotacheck should probably
> >> be bypassed. Given that, we could either adjust when the shrinker is
> >> registered until after quotacheck or set a flag somewhere to cause dquot
> >> reclaim to back out when quotacheck is running. I opted for something
> >> like the latter. Care to test the appended patch?
> >>
> >> Note that I think this does mean that you could still have low memory
> >> issues if you happen to have a lot of quotas defined..
> >>
> > Looks good, no more hangs with 1 GB. Thank you, Brian.
> >
> > If I further reduce RAM to 512 MB, mount succeeds too but multiple "BUG: Bad page state in process
> > mount" errors are reported. Is it one of the expected low memory issues?
> >
> > Martin
> >
> Well, reading back through this thread, this might be related to patch [1] which I didn't apply to
> 4.9.5. I'll retry it next week.
> 
> [1] http://www.spinics.net/lists/linux-xfs/msg03869.html

Yes, it is. :)

--D

> 
> Martin
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 28, 2017, 10:42 p.m. UTC | #4
On Fri, Jan 27, 2017 at 12:07:34PM -0500, Brian Foster wrote:
> The problem looks like a race between dquot reclaim and quotacheck. The
> high level sequence of events is as follows:
> 
>  - During quotacheck, xfs_qm_dqiterate() walks the physical dquot
>    buffers and queues them to the delwri queue.
>  - Next, kswapd kicks in and attempts to reclaim a dquot that is backed
>    by a buffer on the quotacheck delwri queue. xfs_qm_dquot_isolate()
>    acquires the flush lock and attempts to queue to the reclaim delwri
>    queue. This silently fails because the buffer is already queued.
> 
>    From this point forward, the dquot flush lock is not going to be
>    released until the buffer is submitted for I/O and completed via
>    quotacheck.
>  - Quotacheck continues on to the xfs_qm_flush_one() pass, hits the
>    dquot in question and waits on the flush lock to issue the flush of
>    the recalculated values. *deadlock*
> 
> There are at least a few ways to deal with this. We could do something
> granular to fix up the reclaim path to check whether the buffer is
> already queued or something of that nature before we actually invoke the
> flush. I think this is effectively pointless, however, because the first
> part of quotacheck walks and queues all physical dquot buffers anyways.
> 
> In other words, I think dquot reclaim during quotacheck should probably
> be bypassed.
....

> Note that I think this does mean that you could still have low memory
> issues if you happen to have a lot of quotas defined..

Hmmm..... Really needs fixing.

I think submitting the buffer list after xfs_qm_dqiterate() and
waiting for completion will avoid this problem.

However, I suspect reclaim can still race with flushing, so we need
to detect "stuck" dquots, submit the delwri buffer queue and wait,
then flush the dquot again.

Cheers,

Dave.
Brian Foster Jan. 30, 2017, 3:31 p.m. UTC | #5
On Sun, Jan 29, 2017 at 09:42:42AM +1100, Dave Chinner wrote:
> On Fri, Jan 27, 2017 at 12:07:34PM -0500, Brian Foster wrote:
> > The problem looks like a race between dquot reclaim and quotacheck. The
> > high level sequence of events is as follows:
> > 
> >  - During quotacheck, xfs_qm_dqiterate() walks the physical dquot
> >    buffers and queues them to the delwri queue.
> >  - Next, kswapd kicks in and attempts to reclaim a dquot that is backed
> >    by a buffer on the quotacheck delwri queue. xfs_qm_dquot_isolate()
> >    acquires the flush lock and attempts to queue to the reclaim delwri
> >    queue. This silently fails because the buffer is already queued.
> > 
> >    From this point forward, the dquot flush lock is not going to be
> >    released until the buffer is submitted for I/O and completed via
> >    quotacheck.
> >  - Quotacheck continues on to the xfs_qm_flush_one() pass, hits the
> >    dquot in question and waits on the flush lock to issue the flush of
> >    the recalculated values. *deadlock*
> > 
> > There are at least a few ways to deal with this. We could do something
> > granular to fix up the reclaim path to check whether the buffer is
> > already queued or something of that nature before we actually invoke the
> > flush. I think this is effectively pointless, however, because the first
> > part of quotacheck walks and queues all physical dquot buffers anyways.
> > 
> > In other words, I think dquot reclaim during quotacheck should probably
> > be bypassed.
> ....
> 
> > Note that I think this does mean that you could still have low memory
> > issues if you happen to have a lot of quotas defined..
> 
> Hmmm..... Really needs fixing.
> 

Perhaps...

> I think submitting the buffer list after xfs_qm_dqiterate() and
> waiting for completion will avoid this problem.
> 

Yeah. I don't _think_ there are correctness issues since this is at
mount time and thus we don't have the risk of something else dirtying
and flushing a dquot before quotacheck has updated the dquots. The
tradeoff here just seems to be performance. We're now going to write all
of the dquot buffers twice in the common cause to handle the uncommon
low memory case.

It's also worth noting that 1.) I haven't actually investigated how many
dquots would need to exist vs. how little RAM before this becomes a
problem in practice and 2.) we still build up a delwri queue of all the
buffers and despite that there will always be fewer buffers than dquots,
there's nothing to suggest that won't ever pose a problem before the
dquots do if memory is limited enough.

The flip side is that this a pretty minor change code wise..

> However, I suspect reclaim can still race with flushing, so we need
> to detect "stuck" dquots, submit the delwri buffer queue and wait,
> then flush the dquot again.
> 

How so? The first phase of quotacheck resets the dquot buffers and
populates buffer_list. If we submit the queue at that point, we've now
dropped all references to buffers from quotacheck. The next phase of
quotacheck, xfs_qm_dqusage_adjust(), will acquire and dirty the
associated dquots as we bulkstat each inode in the fs.

The prospective difference here is that the dquots can now be flushed
asynchronously by reclaim as this occurs. So quotacheck dirties a dquot,
reclaim kicks in and flushes it, but is now able to actually complete
the flush by adding it to the reclaim delwri queue and submitting the
I/O. If the dqusage_adjust() walk now encounters the dquot again, it may
very well have to allocate and dirty the dquot again (and/or re-read the
dquot buf?) if it has been reclaimed since. I think the risk here is
that we end up thrashing and send I/Os for every per-inode change to a
dquot.

Finally, we get to the xfs_qm_flush_one() walk to flush every dquot that
has been dirtied. If reclaim races at this point, reclaim and
flush_one() should contend on the dquot lock and one or the other be
allowed to flush and queue the buffer. If reclaim wins the race, we may
end up waiting on reclaim I/O in quotacheck, however, due to the
xfs_dqflock() call (perhaps that could be converted to a nowait() lock
and skipped on failure).

I don't think there is a deadlock there, unless I am missing something.
But given that all of this reclaim behavior was enabled purely by
accident and the report in this case is caused by improper serialization
as opposed to a low memory issue that actually depends on (and thus
stresses) a functional reclaim, I'm not convinced this is the right fix
for this problem. Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index b669b12..4e472ca 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -527,6 +527,8 @@  xfs_qm_shrink_scan(
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
 		return 0;
+	if (qi->qi_quotacheck)
+		return 0;
 
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
@@ -623,6 +625,7 @@  xfs_qm_init_quotainfo(
 	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
 	INIT_RADIX_TREE(&qinf->qi_pquota_tree, GFP_NOFS);
 	mutex_init(&qinf->qi_tree_lock);
+	qinf->qi_quotacheck = false;
 
 	/* mutex used to serialize quotaoffs */
 	mutex_init(&qinf->qi_quotaofflock);
@@ -1295,6 +1298,7 @@  xfs_qm_quotacheck(
 	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
 
 	xfs_notice(mp, "Quotacheck needed: Please wait.");
+	mp->m_quotainfo->qi_quotacheck = true;
 
 	/*
 	 * First we go thru all the dquots on disk, USR and GRP/PRJ, and reset
@@ -1384,6 +1388,8 @@  xfs_qm_quotacheck(
 	mp->m_qflags |= flags;
 
  error_return:
+	mp->m_quotainfo->qi_quotacheck = false;
+
 	while (!list_empty(&buffer_list)) {
 		struct xfs_buf *bp =
 			list_first_entry(&buffer_list, struct xfs_buf, b_list);
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 2975a82..d5a443d 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -89,6 +89,7 @@  typedef struct xfs_quotainfo {
 	struct xfs_def_quota	qi_grp_default;
 	struct xfs_def_quota	qi_prj_default;
 	struct shrinker  qi_shrinker;
+	bool			qi_quotacheck;	/* quotacheck is running */
 } xfs_quotainfo_t;
 
 static inline struct radix_tree_root *