Message ID | 20250127080254.1302026-6-senozhatsky@chromium.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | zsmalloc: make zsmalloc preemptible | expand |
On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote: > Introduce new API to map/unmap zsmalloc handle/object. The key > difference is that this API does not impose atomicity restrictions > on its users, unlike zs_map_object() which returns with page-faults > and preemption disabled I think that's not entirely accurate, see below. [..] > @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > goto out; > } > > - /* this object spans two pages */ > - zpdescs[0] = zpdesc; > - zpdescs[1] = get_next_zpdesc(zpdesc); > - BUG_ON(!zpdescs[1]); > + ret = area->vm_buf; > + /* disable page faults to match kmap_local_page() return conditions */ > + pagefault_disable(); Is this accurate/necessary? I am looking at kmap_local_page() and I don't see it. Maybe that's remnant from the old code using kmap_atomic()? > + if (mm != ZS_MM_WO) { > + /* this object spans two pages */ > + zs_obj_copyin(area->vm_buf, zpdesc, off, class->size); > + } > > - ret = __zs_map_object(area, zpdescs, off, class->size); > out: > if (likely(!ZsHugePage(zspage))) > ret += ZS_HANDLE_SIZE;
On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote: > Introduce new API to map/unmap zsmalloc handle/object. The key > difference is that this API does not impose atomicity restrictions > on its users, unlike zs_map_object() which returns with page-faults > and preemption disabled - handle mapping API does not need a per-CPU > vm-area because the users are required to provide an aux buffer for > objects that span several physical pages. I like the idea of supplying the buffer directly to zsmalloc, and zswap already has per-CPU buffers allocated. This will help remove the special case to handle not being able to sleep in zswap_decompress(). That being said, I am not a big fan of the new API for several reasons: - The interface seems complicated, why do we need struct zs_handle_mapping? Can't the user just pass an extra parameter to zs_map_object/zs_unmap_object() to supply the buffer, and the return value is the pointer to the data within the buffer? - This seems to require an additional buffer on the compress side. Right now, zswap compresses the page into its own buffer, maps the handle, and copies to it. Now the map operation will require an extra buffer. I guess in the WO case the buffer is not needed and we can just pass NULL? Taking a step back, it actually seems to me that the mapping interface may not be the best, at least from a zswap perspective. In both cases, we map, copy from/to the handle, then unmap. The special casing here is essentially handling the copy direction. Zram looks fairly similar but I didn't look too closely. I wonder if the API should store/load instead. You either pass a buffer to be stored (equivalent to today's alloc + map + copy), or pass a buffer to load into (equivalent to today's map + copy). What we really need on the zswap side is zs_store() and zs_load(), not zs_map() with different mapping types and an optional buffer if we are going to eventually store. I guess that's part of a larger overhaul and we'd need to update other zpool allocators (or remove them, z3fold should be coming soon). Anyway this is mostly just me ranting because improving the interface to avoid the atomicity requires making it even more complicated, when it's really simple when you think about it in terms of what you really want to do (i.e. store and load). > Keep zs_map_object/zs_unmap_object for the time being, as there are > still users of it, but eventually old API will be removed. > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> > --- > include/linux/zsmalloc.h | 29 ++++++++ > mm/zsmalloc.c | 148 ++++++++++++++++++++++++++++----------- > 2 files changed, 138 insertions(+), 39 deletions(-) > > diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h > index a48cd0ffe57d..72d84537dd38 100644 > --- a/include/linux/zsmalloc.h > +++ b/include/linux/zsmalloc.h > @@ -58,4 +58,33 @@ unsigned long zs_compact(struct zs_pool *pool); > unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size); > > void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats); > + > +struct zs_handle_mapping { > + unsigned long handle; > + /* Points to start of the object data either within local_copy or > + * within local_mapping. This is what callers should use to access > + * or modify handle data. > + */ > + void *handle_mem; > + > + enum zs_mapmode mode; > + union { > + /* > + * Handle object data copied, because it spans across several > + * (non-contiguous) physical pages. This pointer should be > + * set by the zs_map_handle() caller beforehand and should > + * never be accessed directly. > + */ > + void *local_copy; > + /* > + * Handle object mapped directly. Should never be used > + * directly. > + */ > + void *local_mapping; > + }; > +}; > + > +int zs_map_handle(struct zs_pool *pool, struct zs_handle_mapping *map); > +void zs_unmap_handle(struct zs_pool *pool, struct zs_handle_mapping *map); > + > #endif > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c > index a5c1f9852072..281bba4a3277 100644 > --- a/mm/zsmalloc.c > +++ b/mm/zsmalloc.c > @@ -1132,18 +1132,14 @@ static inline void __zs_cpu_down(struct mapping_area *area) > area->vm_buf = NULL; > } > > -static void *__zs_map_object(struct mapping_area *area, > - struct zpdesc *zpdescs[2], int off, int size) > +static void zs_obj_copyin(void *buf, struct zpdesc *zpdesc, int off, int size) > { > + struct zpdesc *zpdescs[2]; > size_t sizes[2]; > - char *buf = area->vm_buf; > - > - /* disable page faults to match kmap_local_page() return conditions */ > - pagefault_disable(); > > - /* no read fastpath */ > - if (area->vm_mm == ZS_MM_WO) > - goto out; > + zpdescs[0] = zpdesc; > + zpdescs[1] = get_next_zpdesc(zpdesc); > + BUG_ON(!zpdescs[1]); > > sizes[0] = PAGE_SIZE - off; > sizes[1] = size - sizes[0]; > @@ -1151,21 +1147,17 @@ static void *__zs_map_object(struct mapping_area *area, > /* copy object to per-cpu buffer */ > memcpy_from_page(buf, zpdesc_page(zpdescs[0]), off, sizes[0]); > memcpy_from_page(buf + sizes[0], zpdesc_page(zpdescs[1]), 0, sizes[1]); > -out: > - return area->vm_buf; > } > > -static void __zs_unmap_object(struct mapping_area *area, > - struct zpdesc *zpdescs[2], int off, int size) > +static void zs_obj_copyout(void *buf, struct zpdesc *zpdesc, int off, int size) > { > + struct zpdesc *zpdescs[2]; > size_t sizes[2]; > - char *buf; > > - /* no write fastpath */ > - if (area->vm_mm == ZS_MM_RO) > - goto out; > + zpdescs[0] = zpdesc; > + zpdescs[1] = get_next_zpdesc(zpdesc); > + BUG_ON(!zpdescs[1]); > > - buf = area->vm_buf; > buf = buf + ZS_HANDLE_SIZE; > size -= ZS_HANDLE_SIZE; > off += ZS_HANDLE_SIZE; > @@ -1176,10 +1168,6 @@ static void __zs_unmap_object(struct mapping_area *area, > /* copy per-cpu buffer to object */ > memcpy_to_page(zpdesc_page(zpdescs[0]), off, buf, sizes[0]); > memcpy_to_page(zpdesc_page(zpdescs[1]), 0, buf + sizes[0], sizes[1]); > - > -out: > - /* enable page faults to match kunmap_local() return conditions */ > - pagefault_enable(); > } > > static int zs_cpu_prepare(unsigned int cpu) > @@ -1260,6 +1248,8 @@ EXPORT_SYMBOL_GPL(zs_get_total_pages); > * against nested mappings. > * > * This function returns with preemption and page faults disabled. > + * > + * NOTE: this function is deprecated and will be removed. > */ > void *zs_map_object(struct zs_pool *pool, unsigned long handle, > enum zs_mapmode mm) > @@ -1268,10 +1258,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > struct zpdesc *zpdesc; > unsigned long obj, off; > unsigned int obj_idx; > - > struct size_class *class; > struct mapping_area *area; > - struct zpdesc *zpdescs[2]; > void *ret; > > /* > @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > goto out; > } > > - /* this object spans two pages */ > - zpdescs[0] = zpdesc; > - zpdescs[1] = get_next_zpdesc(zpdesc); > - BUG_ON(!zpdescs[1]); > + ret = area->vm_buf; > + /* disable page faults to match kmap_local_page() return conditions */ > + pagefault_disable(); > + if (mm != ZS_MM_WO) { > + /* this object spans two pages */ > + zs_obj_copyin(area->vm_buf, zpdesc, off, class->size); > + } > > - ret = __zs_map_object(area, zpdescs, off, class->size); > out: > if (likely(!ZsHugePage(zspage))) > ret += ZS_HANDLE_SIZE; > @@ -1323,13 +1313,13 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > } > EXPORT_SYMBOL_GPL(zs_map_object); > > +/* NOTE: this function is deprecated and will be removed. */ > void zs_unmap_object(struct zs_pool *pool, unsigned long handle) > { > struct zspage *zspage; > struct zpdesc *zpdesc; > unsigned long obj, off; > unsigned int obj_idx; > - > struct size_class *class; > struct mapping_area *area; > > @@ -1340,23 +1330,103 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle) > off = offset_in_page(class->size * obj_idx); > > area = this_cpu_ptr(&zs_map_area); > - if (off + class->size <= PAGE_SIZE) > + if (off + class->size <= PAGE_SIZE) { > kunmap_local(area->vm_addr); > - else { > - struct zpdesc *zpdescs[2]; > + goto out; > + } > > - zpdescs[0] = zpdesc; > - zpdescs[1] = get_next_zpdesc(zpdesc); > - BUG_ON(!zpdescs[1]); > + if (area->vm_mm != ZS_MM_RO) > + zs_obj_copyout(area->vm_buf, zpdesc, off, class->size); > + /* enable page faults to match kunmap_local() return conditions */ > + pagefault_enable(); > > - __zs_unmap_object(area, zpdescs, off, class->size); > - } > +out: > local_unlock(&zs_map_area.lock); > - > zspage_read_unlock(zspage); > } > EXPORT_SYMBOL_GPL(zs_unmap_object); > > +void zs_unmap_handle(struct zs_pool *pool, struct zs_handle_mapping *map) > +{ > + struct zspage *zspage; > + struct zpdesc *zpdesc; > + unsigned long obj, off; > + unsigned int obj_idx; > + struct size_class *class; > + > + obj = handle_to_obj(map->handle); > + obj_to_location(obj, &zpdesc, &obj_idx); > + zspage = get_zspage(zpdesc); > + class = zspage_class(pool, zspage); > + off = offset_in_page(class->size * obj_idx); > + > + if (off + class->size <= PAGE_SIZE) { > + kunmap_local(map->local_mapping); > + goto out; > + } > + > + if (map->mode != ZS_MM_RO) > + zs_obj_copyout(map->local_copy, zpdesc, off, class->size); > + > +out: > + zspage_read_unlock(zspage); > +} > +EXPORT_SYMBOL_GPL(zs_unmap_handle); > + > +int zs_map_handle(struct zs_pool *pool, struct zs_handle_mapping *map) > +{ > + struct zspage *zspage; > + struct zpdesc *zpdesc; > + unsigned long obj, off; > + unsigned int obj_idx; > + struct size_class *class; > + > + WARN_ON(in_interrupt()); > + > + /* It guarantees it can get zspage from handle safely */ > + pool_read_lock(pool); > + obj = handle_to_obj(map->handle); > + obj_to_location(obj, &zpdesc, &obj_idx); > + zspage = get_zspage(zpdesc); > + > + /* > + * migration cannot move any zpages in this zspage. Here, class->lock > + * is too heavy since callers would take some time until they calls > + * zs_unmap_object API so delegate the locking from class to zspage > + * which is smaller granularity. > + */ > + zspage_read_lock(zspage); > + pool_read_unlock(pool); > + > + class = zspage_class(pool, zspage); > + off = offset_in_page(class->size * obj_idx); > + > + if (off + class->size <= PAGE_SIZE) { > + /* this object is contained entirely within a page */ > + map->local_mapping = kmap_local_zpdesc(zpdesc); > + map->handle_mem = map->local_mapping + off; > + goto out; > + } > + > + if (WARN_ON_ONCE(!map->local_copy)) { > + zspage_read_unlock(zspage); > + return -EINVAL; > + } > + > + map->handle_mem = map->local_copy; > + if (map->mode != ZS_MM_WO) { > + /* this object spans two pages */ > + zs_obj_copyin(map->local_copy, zpdesc, off, class->size); > + } > + > +out: > + if (likely(!ZsHugePage(zspage))) > + map->handle_mem += ZS_HANDLE_SIZE; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(zs_map_handle); > + > /** > * zs_huge_class_size() - Returns the size (in bytes) of the first huge > * zsmalloc &size_class. > -- > 2.48.1.262.g85cc9f2d1e-goog >
On (25/01/27 21:26), Yosry Ahmed wrote: > On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote: > > Introduce new API to map/unmap zsmalloc handle/object. The key > > difference is that this API does not impose atomicity restrictions > > on its users, unlike zs_map_object() which returns with page-faults > > and preemption disabled > > I think that's not entirely accurate, see below. Preemption is disabled via zspage-s rwlock_t - zs_map_object() returns with it being locked and it's being unlocked in zs_unmap_object(). Then the function disables pagefaults and per-CPU local lock (protects per-CPU vm-area) additionally disables preemption. > [..] > > @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > > goto out; > > } > > > > - /* this object spans two pages */ > > - zpdescs[0] = zpdesc; > > - zpdescs[1] = get_next_zpdesc(zpdesc); > > - BUG_ON(!zpdescs[1]); > > + ret = area->vm_buf; > > + /* disable page faults to match kmap_local_page() return conditions */ > > + pagefault_disable(); > > Is this accurate/necessary? I am looking at kmap_local_page() and I > don't see it. Maybe that's remnant from the old code using > kmap_atomic()? No, this does not look accuare nor neccesary to me. I asume that's from a very long time ago, but regardless of that I don't really understand why that API wants to resemblwe kmap_atomic() (I think that was the intention). This interface if expected to be gone so I didn't want to dig into it and fix it.
On Tue, Jan 28, 2025 at 09:37:20AM +0900, Sergey Senozhatsky wrote: > On (25/01/27 21:26), Yosry Ahmed wrote: > > On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote: > > > Introduce new API to map/unmap zsmalloc handle/object. The key > > > difference is that this API does not impose atomicity restrictions > > > on its users, unlike zs_map_object() which returns with page-faults > > > and preemption disabled > > > > I think that's not entirely accurate, see below. > > Preemption is disabled via zspage-s rwlock_t - zs_map_object() returns > with it being locked and it's being unlocked in zs_unmap_object(). Then > the function disables pagefaults and per-CPU local lock (protects per-CPU > vm-area) additionally disables preemption. Right, I meant it does not always disable page faults. > > > [..] > > > @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > > > goto out; > > > } > > > > > > - /* this object spans two pages */ > > > - zpdescs[0] = zpdesc; > > > - zpdescs[1] = get_next_zpdesc(zpdesc); > > > - BUG_ON(!zpdescs[1]); > > > + ret = area->vm_buf; > > > + /* disable page faults to match kmap_local_page() return conditions */ > > > + pagefault_disable(); > > > > Is this accurate/necessary? I am looking at kmap_local_page() and I > > don't see it. Maybe that's remnant from the old code using > > kmap_atomic()? > > No, this does not look accuare nor neccesary to me. I asume that's from > a very long time ago, but regardless of that I don't really understand > why that API wants to resemblwe kmap_atomic() (I think that was the > intention). This interface if expected to be gone so I didn't want > to dig into it and fix it. My assumption has been that back when we were using kmap_atomic(), which disables page faults, we wanted to make this API's behavior consistent for users where or not we called kmap_atomic() -- so this makes sure it always disables page faults. Now that we switched to kmap_local_page(), which doesn't disable page faults, this was left behind, ulitmately making the interface inconsistent and contradicting the purpose of its existence. This is 100% speculation on my end :) Anyway, if this function will be removed soon then it's not worth revisiting it now.
On (25/01/27 21:58), Yosry Ahmed wrote: > On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote: > > Introduce new API to map/unmap zsmalloc handle/object. The key > > difference is that this API does not impose atomicity restrictions > > on its users, unlike zs_map_object() which returns with page-faults > > and preemption disabled - handle mapping API does not need a per-CPU > > vm-area because the users are required to provide an aux buffer for > > objects that span several physical pages. > > I like the idea of supplying the buffer directly to zsmalloc, and zswap > already has per-CPU buffers allocated. This will help remove the special > case to handle not being able to sleep in zswap_decompress(). The interface, basically, is what we currently have, but the state is moved out of zsmalloc internal per-CPU vm-area. > That being said, I am not a big fan of the new API for several reasons: > - The interface seems complicated, why do we need struct > zs_handle_mapping? Can't the user just pass an extra parameter to > zs_map_object/zs_unmap_object() to supply the buffer, and the return > value is the pointer to the data within the buffer? At least now we need to save some state - e.g. direction of the map() so that during unmap zsmalloc determines if it needs to perform copy-out or not. It also needs that state in order to know if the buffer needs to be unmapped. zsmalloc MAP has two cases: a) the object spans several physical non-contig pages: copy-in object into aux buffer and return (linear) pointer to that buffer b) the object is contained within a physical page: kmap that page and return (linear) pointer to that mapping, unmap in zs_unmap_object(). > - This seems to require an additional buffer on the compress side. Right > now, zswap compresses the page into its own buffer, maps the handle, > and copies to it. Now the map operation will require an extra buffer. Yes, for (a) mentioned above. > I guess in the WO case the buffer is not needed and we can just pass > NULL? Yes. > Taking a step back, it actually seems to me that the mapping interface > may not be the best, at least from a zswap perspective. In both cases, > we map, copy from/to the handle, then unmap. The special casing here is > essentially handling the copy direction. Zram looks fairly similar but I > didn't look too closely. > > I wonder if the API should store/load instead. You either pass a buffer > to be stored (equivalent to today's alloc + map + copy), or pass a > buffer to load into (equivalent to today's map + copy). What we really > need on the zswap side is zs_store() and zs_load(), not zs_map() with > different mapping types and an optional buffer if we are going to > eventually store. I guess that's part of a larger overhaul and we'd need > to update other zpool allocators (or remove them, z3fold should be > coming soon). So I though about it: load and store. zs_obj_load() { zspage->page kmap, etc. memcpy buf page # if direction is not WO unmap } zs_obj_store() { zspage->page kmap, etc. memcpy page buf # if direction is not RO unmap } load+store would not require zsmalloc to be preemptible internally, we could just keep existing atomic locks and it would make things a little simpler on the zram side (slot-free-notification is called from atomic section). But, and it's a big but. And it's (b) from the above. I wasn't brave enough to just drop (b) optimization and replace it with memcpy(), especially when we work with relatively large objects (say size-class 3600 bytes and above). This certainly would not make battery powered devices happier. Maybe in zswap the page is only read once (is that correct?), but in zram page can be read multiple times (e.g. when zram is used as a raw block-dev, or has a mounted fs on it) which means multiple extra memcpy()-s.
On (25/01/28 00:49), Yosry Ahmed wrote: > > Preemption is disabled via zspage-s rwlock_t - zs_map_object() returns > > with it being locked and it's being unlocked in zs_unmap_object(). Then > > the function disables pagefaults and per-CPU local lock (protects per-CPU > > vm-area) additionally disables preemption. > > Right, I meant it does not always disable page faults. I'll add "sometimes" :) [..] > Anyway, if this function will be removed soon then it's not worth > revisiting it now. Ack.
On Tue, Jan 28, 2025 at 09:59:55AM +0900, Sergey Senozhatsky wrote: > On (25/01/27 21:58), Yosry Ahmed wrote: > > On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote: > > > Introduce new API to map/unmap zsmalloc handle/object. The key > > > difference is that this API does not impose atomicity restrictions > > > on its users, unlike zs_map_object() which returns with page-faults > > > and preemption disabled - handle mapping API does not need a per-CPU > > > vm-area because the users are required to provide an aux buffer for > > > objects that span several physical pages. > > > > I like the idea of supplying the buffer directly to zsmalloc, and zswap > > already has per-CPU buffers allocated. This will help remove the special > > case to handle not being able to sleep in zswap_decompress(). > > The interface, basically, is what we currently have, but the state > is moved out of zsmalloc internal per-CPU vm-area. > > > That being said, I am not a big fan of the new API for several reasons: > > - The interface seems complicated, why do we need struct > > zs_handle_mapping? Can't the user just pass an extra parameter to > > zs_map_object/zs_unmap_object() to supply the buffer, and the return > > value is the pointer to the data within the buffer? > > At least now we need to save some state - e.g. direction of the map() > so that during unmap zsmalloc determines if it needs to perform copy-out > or not. It also needs that state in order to know if the buffer needs > to be unmapped. > > zsmalloc MAP has two cases: > a) the object spans several physical non-contig pages: copy-in object into > aux buffer and return (linear) pointer to that buffer > b) the object is contained within a physical page: kmap that page and > return (linear) pointer to that mapping, unmap in zs_unmap_object(). Ack. See below. > > - This seems to require an additional buffer on the compress side. Right > > now, zswap compresses the page into its own buffer, maps the handle, > > and copies to it. Now the map operation will require an extra buffer. > > Yes, for (a) mentioned above. > > > I guess in the WO case the buffer is not needed and we can just pass > > NULL? > > Yes. Perhaps we want to document this and enforce it (make sure that the NULL-ness of the buffer matches the access type). > > Taking a step back, it actually seems to me that the mapping interface > > may not be the best, at least from a zswap perspective. In both cases, > > we map, copy from/to the handle, then unmap. The special casing here is > > essentially handling the copy direction. Zram looks fairly similar but I > > didn't look too closely. > > > > I wonder if the API should store/load instead. You either pass a buffer > > to be stored (equivalent to today's alloc + map + copy), or pass a > > buffer to load into (equivalent to today's map + copy). What we really > > need on the zswap side is zs_store() and zs_load(), not zs_map() with > > different mapping types and an optional buffer if we are going to > > eventually store. I guess that's part of a larger overhaul and we'd need > > to update other zpool allocators (or remove them, z3fold should be > > coming soon). > > So I though about it: load and store. > > zs_obj_load() > { > zspage->page kmap, etc. > memcpy buf page # if direction is not WO > unmap > } > > zs_obj_store() > { > zspage->page kmap, etc. > memcpy page buf # if direction is not RO > unmap > } > > load+store would not require zsmalloc to be preemptible internally, we > could just keep existing atomic locks and it would make things a little > simpler on the zram side (slot-free-notification is called from atomic > section). > > But, and it's a big but. And it's (b) from the above. I wasn't brave > enough to just drop (b) optimization and replace it with memcpy(), > especially when we work with relatively large objects (say size-class > 3600 bytes and above). This certainly would not make battery powered > devices happier. Maybe in zswap the page is only read once (is that > correct?), but in zram page can be read multiple times (e.g. when zram > is used as a raw block-dev, or has a mounted fs on it) which means > multiple extra memcpy()-s. In zswap, because we use the crypto_acomp API, when we cannot sleep with the object mapped (which is true for zsmalloc), we just copy the compressed object into a preallocated buffer anyway. So having a zs_obj_load() interface would move that copy inside zsmalloc. With your series, zswap can drop the memcpy and save some cycles on the compress side. I didn't realize that zram does not perform any copies on the read/decompress side. Maybe the load interface can still provide a buffer to avoid the copy where possible? I suspect with that we don't need the state and can just pass a pointer. We'd need another call to potentially unmap, so maybe load_start/load_end, or read_start/read_end. Something like: zs_obj_read_start(.., buf) { if (contained in one page) return kmapped obj else memcpy to buf return buf } zs_obj_read_end(.., buf) { if (container in one page) kunmap } The interface is more straightforward and we can drop the map flags entirely, unless I missed something here. Unfortunately you'd still need the locking changes in zsmalloc to make zram reads fully preemptible. I am not suggesting that we have to go this way, just throwing out ideas. BTW, are we positive that the locking changes made in this series are not introducing regressions? I'd hate for us to avoid an extra copy but end up paying for it in lock contention.
On (25/01/28 01:36), Yosry Ahmed wrote: > > Yes, for (a) mentioned above. > > > > > I guess in the WO case the buffer is not needed and we can just pass > > > NULL? > > > > Yes. > > Perhaps we want to document this and enforce it (make sure that the > NULL-ness of the buffer matches the access type). Right. > > But, and it's a big but. And it's (b) from the above. I wasn't brave > > enough to just drop (b) optimization and replace it with memcpy(), > > especially when we work with relatively large objects (say size-class > > 3600 bytes and above). This certainly would not make battery powered > > devices happier. Maybe in zswap the page is only read once (is that > > correct?), but in zram page can be read multiple times (e.g. when zram > > is used as a raw block-dev, or has a mounted fs on it) which means > > multiple extra memcpy()-s. > > In zswap, because we use the crypto_acomp API, when we cannot sleep with > the object mapped (which is true for zsmalloc), we just copy the > compressed object into a preallocated buffer anyway. So having a > zs_obj_load() interface would move that copy inside zsmalloc. Yeah, I saw zpool_can_sleep_mapped() and had the same thought. zram, as of now, doesn't support algos that can/need schedule internally for whatever reason - kmalloc, mutex, H/W wait, etc. > With your series, zswap can drop the memcpy and save some cycles on the > compress side. I didn't realize that zram does not perform any copies on the > read/decompress side. > > Maybe the load interface can still provide a buffer to avoid the copy > where possible? I suspect with that we don't need the state and can > just pass a pointer. We'd need another call to potentially unmap, so > maybe load_start/load_end, or read_start/read_end. > > Something like: > > zs_obj_read_start(.., buf) > { > if (contained in one page) > return kmapped obj > else > memcpy to buf > return buf > } > > zs_obj_read_end(.., buf) > { > if (container in one page) > kunmap > } > > The interface is more straightforward and we can drop the map flags > entirely, unless I missed something here. Unfortunately you'd still need > the locking changes in zsmalloc to make zram reads fully preemptible. Agreed, the interface part is less of a problem, the atomicity of zsmalloc is a much bigger issue. We, technically, only need to mark zspage as "being used, don't free" so that zsfree/compaction/migration don't mess with it, but this is only "technically". In practice we then have CPU0 CPU1 zs_map_object set READ bit migrate schedule pool rwlock size class spin-lock wait for READ bit to clear ... set WRITE bit clear READ bit and the whole thing collapses like a house of cards. I wasn't able to trigger a watchdog on my tests, but the pattern is there and it's enough. Maybe we can teach compaction and migration to try-WRITE and bail out if the page is locked, but I don't know. > I am not suggesting that we have to go this way, just throwing out > ideas. Sure, load+store is still an option. While that zs_map_object() optimization is nice, it may have two sides [in zram case]. On one hand, we safe memcpy() [but only for certain objects], on the other hand, we keep the page locked for the entire decompression duration, which can be quite a while (e.g. when algorithm is configured with a very high compression level): CPU0 CPU1 zs_map_object read lock page rwlock write lock page rwlock spin decompress() ... spin a lot read unlock page rwlock Maybe copy-in is just an okay thing to do. Let me try to measure. > BTW, are we positive that the locking changes made in this series are > not introducing regressions? Cannot claim that with confidence. Our workloads don't match, we don't even use zsmalloc in the same way :) Here be dragons.
On (25/01/28 14:29), Sergey Senozhatsky wrote:
> Maybe copy-in is just an okay thing to do. Let me try to measure.
Naaah, not really okay. On our memory-pressure test (4GB device, 4
CPUs) that kmap_local thingy appears to save approx 6GB of memcpy().
CPY stats: 734954 1102903168 4926116 6566654656
There were 734954 cases when we memcpy() [object spans two pages] with
accumulated size of 1102903168 bytes, and 4926116 cases when we took
a shortcut via kmap_local and avoided memcpy(), with accumulated size
of 6566654656 bytes.
In both cases I counted only RO direction for map, and WO direction
for unmap.
On (25/01/28 14:29), Sergey Senozhatsky wrote: > Maybe we can teach compaction and migration to try-WRITE and > bail out if the page is locked, but I don't know. This seems to be working just fine.
On Tue, Jan 28, 2025 at 06:38:35PM +0900, Sergey Senozhatsky wrote: > On (25/01/28 14:29), Sergey Senozhatsky wrote: > > Maybe copy-in is just an okay thing to do. Let me try to measure. > > Naaah, not really okay. On our memory-pressure test (4GB device, 4 > CPUs) that kmap_local thingy appears to save approx 6GB of memcpy(). > > CPY stats: 734954 1102903168 4926116 6566654656 > > There were 734954 cases when we memcpy() [object spans two pages] with > accumulated size of 1102903168 bytes, and 4926116 cases when we took > a shortcut via kmap_local and avoided memcpy(), with accumulated size > of 6566654656 bytes. > > In both cases I counted only RO direction for map, and WO direction > for unmap. Yeah seems like the optimization is effective, at least on that workload, unless the memcpy() is cheap and avoiding it is not buying as much (do you know if that's the case?). Anyway, we can keep the optimization and zswap could start making use of it if zsmalloc becomes preemtible, so that's still a win.
On Tue, Jan 28, 2025 at 08:10:10PM +0900, Sergey Senozhatsky wrote: > On (25/01/28 14:29), Sergey Senozhatsky wrote: > > Maybe we can teach compaction and migration to try-WRITE and > > bail out if the page is locked, but I don't know. > > This seems to be working just fine. Does this mean we won't need as much locking changes to get zsmalloc to be preemtible? I am slightly worried about how these changes will affect performance tbh.
On (25/01/28 17:22), Yosry Ahmed wrote: > On Tue, Jan 28, 2025 at 08:10:10PM +0900, Sergey Senozhatsky wrote: > > On (25/01/28 14:29), Sergey Senozhatsky wrote: > > > Maybe we can teach compaction and migration to try-WRITE and > > > bail out if the page is locked, but I don't know. > > > > This seems to be working just fine. > > Does this mean we won't need as much locking changes to get zsmalloc to > be preemtible? Correct, only zspage lock is getting converted.
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index a48cd0ffe57d..72d84537dd38 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -58,4 +58,33 @@ unsigned long zs_compact(struct zs_pool *pool); unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size); void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats); + +struct zs_handle_mapping { + unsigned long handle; + /* Points to start of the object data either within local_copy or + * within local_mapping. This is what callers should use to access + * or modify handle data. + */ + void *handle_mem; + + enum zs_mapmode mode; + union { + /* + * Handle object data copied, because it spans across several + * (non-contiguous) physical pages. This pointer should be + * set by the zs_map_handle() caller beforehand and should + * never be accessed directly. + */ + void *local_copy; + /* + * Handle object mapped directly. Should never be used + * directly. + */ + void *local_mapping; + }; +}; + +int zs_map_handle(struct zs_pool *pool, struct zs_handle_mapping *map); +void zs_unmap_handle(struct zs_pool *pool, struct zs_handle_mapping *map); + #endif diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index a5c1f9852072..281bba4a3277 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1132,18 +1132,14 @@ static inline void __zs_cpu_down(struct mapping_area *area) area->vm_buf = NULL; } -static void *__zs_map_object(struct mapping_area *area, - struct zpdesc *zpdescs[2], int off, int size) +static void zs_obj_copyin(void *buf, struct zpdesc *zpdesc, int off, int size) { + struct zpdesc *zpdescs[2]; size_t sizes[2]; - char *buf = area->vm_buf; - - /* disable page faults to match kmap_local_page() return conditions */ - pagefault_disable(); - /* no read fastpath */ - if (area->vm_mm == ZS_MM_WO) - goto out; + zpdescs[0] = zpdesc; + zpdescs[1] = get_next_zpdesc(zpdesc); + BUG_ON(!zpdescs[1]); sizes[0] = PAGE_SIZE - off; sizes[1] = size - sizes[0]; @@ -1151,21 +1147,17 @@ static void *__zs_map_object(struct mapping_area *area, /* copy object to per-cpu buffer */ memcpy_from_page(buf, zpdesc_page(zpdescs[0]), off, sizes[0]); memcpy_from_page(buf + sizes[0], zpdesc_page(zpdescs[1]), 0, sizes[1]); -out: - return area->vm_buf; } -static void __zs_unmap_object(struct mapping_area *area, - struct zpdesc *zpdescs[2], int off, int size) +static void zs_obj_copyout(void *buf, struct zpdesc *zpdesc, int off, int size) { + struct zpdesc *zpdescs[2]; size_t sizes[2]; - char *buf; - /* no write fastpath */ - if (area->vm_mm == ZS_MM_RO) - goto out; + zpdescs[0] = zpdesc; + zpdescs[1] = get_next_zpdesc(zpdesc); + BUG_ON(!zpdescs[1]); - buf = area->vm_buf; buf = buf + ZS_HANDLE_SIZE; size -= ZS_HANDLE_SIZE; off += ZS_HANDLE_SIZE; @@ -1176,10 +1168,6 @@ static void __zs_unmap_object(struct mapping_area *area, /* copy per-cpu buffer to object */ memcpy_to_page(zpdesc_page(zpdescs[0]), off, buf, sizes[0]); memcpy_to_page(zpdesc_page(zpdescs[1]), 0, buf + sizes[0], sizes[1]); - -out: - /* enable page faults to match kunmap_local() return conditions */ - pagefault_enable(); } static int zs_cpu_prepare(unsigned int cpu) @@ -1260,6 +1248,8 @@ EXPORT_SYMBOL_GPL(zs_get_total_pages); * against nested mappings. * * This function returns with preemption and page faults disabled. + * + * NOTE: this function is deprecated and will be removed. */ void *zs_map_object(struct zs_pool *pool, unsigned long handle, enum zs_mapmode mm) @@ -1268,10 +1258,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, struct zpdesc *zpdesc; unsigned long obj, off; unsigned int obj_idx; - struct size_class *class; struct mapping_area *area; - struct zpdesc *zpdescs[2]; void *ret; /* @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, goto out; } - /* this object spans two pages */ - zpdescs[0] = zpdesc; - zpdescs[1] = get_next_zpdesc(zpdesc); - BUG_ON(!zpdescs[1]); + ret = area->vm_buf; + /* disable page faults to match kmap_local_page() return conditions */ + pagefault_disable(); + if (mm != ZS_MM_WO) { + /* this object spans two pages */ + zs_obj_copyin(area->vm_buf, zpdesc, off, class->size); + } - ret = __zs_map_object(area, zpdescs, off, class->size); out: if (likely(!ZsHugePage(zspage))) ret += ZS_HANDLE_SIZE; @@ -1323,13 +1313,13 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, } EXPORT_SYMBOL_GPL(zs_map_object); +/* NOTE: this function is deprecated and will be removed. */ void zs_unmap_object(struct zs_pool *pool, unsigned long handle) { struct zspage *zspage; struct zpdesc *zpdesc; unsigned long obj, off; unsigned int obj_idx; - struct size_class *class; struct mapping_area *area; @@ -1340,23 +1330,103 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle) off = offset_in_page(class->size * obj_idx); area = this_cpu_ptr(&zs_map_area); - if (off + class->size <= PAGE_SIZE) + if (off + class->size <= PAGE_SIZE) { kunmap_local(area->vm_addr); - else { - struct zpdesc *zpdescs[2]; + goto out; + } - zpdescs[0] = zpdesc; - zpdescs[1] = get_next_zpdesc(zpdesc); - BUG_ON(!zpdescs[1]); + if (area->vm_mm != ZS_MM_RO) + zs_obj_copyout(area->vm_buf, zpdesc, off, class->size); + /* enable page faults to match kunmap_local() return conditions */ + pagefault_enable(); - __zs_unmap_object(area, zpdescs, off, class->size); - } +out: local_unlock(&zs_map_area.lock); - zspage_read_unlock(zspage); } EXPORT_SYMBOL_GPL(zs_unmap_object); +void zs_unmap_handle(struct zs_pool *pool, struct zs_handle_mapping *map) +{ + struct zspage *zspage; + struct zpdesc *zpdesc; + unsigned long obj, off; + unsigned int obj_idx; + struct size_class *class; + + obj = handle_to_obj(map->handle); + obj_to_location(obj, &zpdesc, &obj_idx); + zspage = get_zspage(zpdesc); + class = zspage_class(pool, zspage); + off = offset_in_page(class->size * obj_idx); + + if (off + class->size <= PAGE_SIZE) { + kunmap_local(map->local_mapping); + goto out; + } + + if (map->mode != ZS_MM_RO) + zs_obj_copyout(map->local_copy, zpdesc, off, class->size); + +out: + zspage_read_unlock(zspage); +} +EXPORT_SYMBOL_GPL(zs_unmap_handle); + +int zs_map_handle(struct zs_pool *pool, struct zs_handle_mapping *map) +{ + struct zspage *zspage; + struct zpdesc *zpdesc; + unsigned long obj, off; + unsigned int obj_idx; + struct size_class *class; + + WARN_ON(in_interrupt()); + + /* It guarantees it can get zspage from handle safely */ + pool_read_lock(pool); + obj = handle_to_obj(map->handle); + obj_to_location(obj, &zpdesc, &obj_idx); + zspage = get_zspage(zpdesc); + + /* + * migration cannot move any zpages in this zspage. Here, class->lock + * is too heavy since callers would take some time until they calls + * zs_unmap_object API so delegate the locking from class to zspage + * which is smaller granularity. + */ + zspage_read_lock(zspage); + pool_read_unlock(pool); + + class = zspage_class(pool, zspage); + off = offset_in_page(class->size * obj_idx); + + if (off + class->size <= PAGE_SIZE) { + /* this object is contained entirely within a page */ + map->local_mapping = kmap_local_zpdesc(zpdesc); + map->handle_mem = map->local_mapping + off; + goto out; + } + + if (WARN_ON_ONCE(!map->local_copy)) { + zspage_read_unlock(zspage); + return -EINVAL; + } + + map->handle_mem = map->local_copy; + if (map->mode != ZS_MM_WO) { + /* this object spans two pages */ + zs_obj_copyin(map->local_copy, zpdesc, off, class->size); + } + +out: + if (likely(!ZsHugePage(zspage))) + map->handle_mem += ZS_HANDLE_SIZE; + + return 0; +} +EXPORT_SYMBOL_GPL(zs_map_handle); + /** * zs_huge_class_size() - Returns the size (in bytes) of the first huge * zsmalloc &size_class.
Introduce new API to map/unmap zsmalloc handle/object. The key difference is that this API does not impose atomicity restrictions on its users, unlike zs_map_object() which returns with page-faults and preemption disabled - handle mapping API does not need a per-CPU vm-area because the users are required to provide an aux buffer for objects that span several physical pages. Keep zs_map_object/zs_unmap_object for the time being, as there are still users of it, but eventually old API will be removed. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> --- include/linux/zsmalloc.h | 29 ++++++++ mm/zsmalloc.c | 148 ++++++++++++++++++++++++++++----------- 2 files changed, 138 insertions(+), 39 deletions(-)