Message ID | 20190401082958.26470-1-nborisov@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | Preamture OOM fixes | expand |
On Mon, Apr 01, 2019 at 11:29:56AM +0300, Nikolay Borisov wrote: > David reported that following the async cow path refactoring that got into > misc-next he started seeing premature OOM kills due to large page allocations > (order 4). This was due to now the whole async context (alongisde ancillary > structures, describing chunks) being allocated in a single kmalloc call. The > first patch of this series aims to fix that by converting the kmalloc call to > kvmalloc. This will have virtually no perfromance impact for small writes and > for larger ones will fallback to vmalloc. > > However, while testing the first patch I observed BUG_ONs in the async > csum calculation path to trigger. This is clearly caused by the fact that > the first patch introduces a fairly large memory allocation. Nevertheless, > having large allocations in the csum path without a fallback to some sort of > order 0 page allocation or at least graceful handling of memory errors is gross. > Patch 2 aims to rectify this by switching the allocation in btrfs_csum_one_bio > to using kvmalloc, giving greated chance for success. Even if patch 1 is ignored > nothing prevents the csum calculation to trigger the bug on depending on the memory > usage/fragmentation of the server so IMO it's beneficial either ways. For the record: the patches that introduced the preallocation + followup cleanups have been removed from misc-next until we're sure that the supposed change in IO submission path does not introduce further surprises. The potential performance impact is about to be evaluated. The error not-handling currently depends on the kmalloc behaviour that does not fail for small allocations, so current code is safer for users. The preallocation provides a nicer way to handle errors early and fails gracefully, unfortunately this is out of the safe zone of kmalloc so this introduces a new problem. The io submission path is probably the biggest and worst BUG_ON error handling misuse, but also the hardest one to fix. We'll get there, but as the prealloc patches showed we need to be careful.