diff mbox series

tmpfs: use ida to get inode number

Message ID 1574259798-144561-1-git-send-email-zhengbin13@huawei.com (mailing list archive)
State New, archived
Headers show
Series tmpfs: use ida to get inode number | expand

Commit Message

Zheng Bin Nov. 20, 2019, 2:23 p.m. UTC
Use a script to test tmpfs, after 10 days, there will be files share the
same inode number, thus bug happens.
The script is as follows:
while(1) {
  create a file
  dlopen it
  ...
  remove it
}

I have tried to change last_ino type to unsigned long, while this was
rejected, see details on https://patchwork.kernel.org/patch/11023915.

Use ida to get inode number, from the fs_mark test, performance impact
is small.

CPU core: 128  memory: 8G
Use fs_mark to create ten million files first in tmpfs.

Then test performance in the following command(test five times):
rm -rf /tmp/fsmark_test
sync
echo 3 > /proc/sys/vm/drop_caches
sleep 10
fs_mark -t 20 -s 0 -n 102400 -D 64 -N 1600 -L 3 -d /tmp/fsmark_test

result is file creation speed(Files/sec).
get_next_ino  |  use ida
439506.7      |  423219.0
441453.0      |  406832.7
439868.0      |  441283.3
445642.3      |  428221.3
441776.7      |  438129.0
average:
441649.34     |  427537.06

Signed-off-by: zhengbin <zhengbin13@huawei.com>
---
 mm/shmem.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

--
2.7.4

Comments

Matthew Wilcox (Oracle) Nov. 20, 2019, 3:45 p.m. UTC | #1
On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote:
> I have tried to change last_ino type to unsigned long, while this was
> rejected, see details on https://patchwork.kernel.org/patch/11023915.

Did you end up trying sbitmap?

What I think is fundamentally wrong with this patch is that you've found a
problem in get_next_ino() and decided to use a different scheme for this
one filesystem, leaving every other filesystem which uses get_next_ino()
facing the same problem.

That could be acceptable if you explained why tmpfs is fundamentally
different from all the other filesystems that use get_next_ino(), but
you haven't (and I don't think there is such a difference.  eg pipes,
autofs and ipc mqueue could all have the same problem.

There are some other problems I noticed, but they're not worth bringing
up until this fundamental design choice is justified.
Zheng Bin Nov. 21, 2019, 2:36 a.m. UTC | #2
On 2019/11/20 23:45, Matthew Wilcox wrote:
> On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote:
>> I have tried to change last_ino type to unsigned long, while this was
>> rejected, see details on https://patchwork.kernel.org/patch/11023915.
> Did you end up trying sbitmap?

Maybe sbitmap is not a good solution, max_inodes of tmpfs are controlled by mount options--nrinodes,

which can be modified by remountfs(bigger or smaller), as the comment of function sbitmap_resize says:

 * Doesn't reallocate anything. It's up to the caller to ensure that the new
 * depth doesn't exceed the depth that the sb was initialized with.

We can modify this to meet the growing requirements, there will still be questions as follows:

1. tmpfs is a ram filesystem, we need to allocate sbitmap memory for sbinfo->max_inodes(while this maybe huge)

2.If remountfs changes  max_inode, we have to deal with it, while this may take a long time

(bigger: we need to free the old sbitmap memory, allocate new memory, copy the old sbitmap to new sbitmap

smaller: How do we deal with it?ie: we use sb->map[inode number/8] to find the sbitmap, we need to change the exist

inode numbers?while this maybe used by userspace application.)

>
> What I think is fundamentally wrong with this patch is that you've found a
> problem in get_next_ino() and decided to use a different scheme for this
> one filesystem, leaving every other filesystem which uses get_next_ino()
> facing the same problem.
>
> That could be acceptable if you explained why tmpfs is fundamentally
> different from all the other filesystems that use get_next_ino(), but
> you haven't (and I don't think there is such a difference.  eg pipes,
> autofs and ipc mqueue could all have the same problem.

tmpfs is same with all the other filesystems that use get_next_ino(), but we need to solve this problem one by one.

If tmpfs is ok, we can modify the other filesystems too. Besides, I do not  recommend all file systems share the same

global variable, for performance impact consideration.

>
> There are some other problems I noticed, but they're not worth bringing
> up until this fundamental design choice is justified.
Agree, thanks.
>
Al Viro Nov. 21, 2019, 4:31 a.m. UTC | #3
On Wed, Nov 20, 2019 at 07:45:52AM -0800, Matthew Wilcox wrote:
> On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote:
> > I have tried to change last_ino type to unsigned long, while this was
> > rejected, see details on https://patchwork.kernel.org/patch/11023915.
> 
> Did you end up trying sbitmap?
> 
> What I think is fundamentally wrong with this patch is that you've found a
> problem in get_next_ino() and decided to use a different scheme for this
> one filesystem, leaving every other filesystem which uses get_next_ino()
> facing the same problem.
> 
> That could be acceptable if you explained why tmpfs is fundamentally
> different from all the other filesystems that use get_next_ino(), but
> you haven't (and I don't think there is such a difference.  eg pipes,
> autofs and ipc mqueue could all have the same problem.

If you think that anyone is willing to pay one hell of a price on each
pipe(2)...  Note that get_next_ino() is pretty careful about staying
within per-cpu stuff most of the time; it hits any cross-CPU traffic
only in 1/1024th of calls.  This, AFAICS, dirties shared cachelines
on each call.  And there's a plenty of pipe-heavy workloads, for obvious
reasons.
Hugh Dickins Nov. 21, 2019, 4:52 a.m. UTC | #4
On Thu, 21 Nov 2019, zhengbin (A) wrote:
> On 2019/11/20 23:45, Matthew Wilcox wrote:
> > On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote:
> >> I have tried to change last_ino type to unsigned long, while this was
> >> rejected, see details on https://patchwork.kernel.org/patch/11023915.
> > Did you end up trying sbitmap?
> 
> Maybe sbitmap is not a good solution, max_inodes of tmpfs are controlled by mount options--nrinodes,
> 
> which can be modified by remountfs(bigger or smaller), as the comment of function sbitmap_resize says:
> 
>  * Doesn't reallocate anything. It's up to the caller to ensure that the new
>  * depth doesn't exceed the depth that the sb was initialized with.
> 
> We can modify this to meet the growing requirements, there will still be questions as follows:
> 
> 1. tmpfs is a ram filesystem, we need to allocate sbitmap memory for sbinfo->max_inodes(while this maybe huge)
> 
> 2.If remountfs changes  max_inode, we have to deal with it, while this may take a long time
> 
> (bigger: we need to free the old sbitmap memory, allocate new memory, copy the old sbitmap to new sbitmap
> 
> smaller: How do we deal with it?ie: we use sb->map[inode number/8] to find the sbitmap, we need to change the exist
> 
> inode numbers?while this maybe used by userspace application.)
> 
> >
> > What I think is fundamentally wrong with this patch is that you've found a
> > problem in get_next_ino() and decided to use a different scheme for this
> > one filesystem, leaving every other filesystem which uses get_next_ino()
> > facing the same problem.
> >
> > That could be acceptable if you explained why tmpfs is fundamentally
> > different from all the other filesystems that use get_next_ino(), but
> > you haven't (and I don't think there is such a difference.  eg pipes,
> > autofs and ipc mqueue could all have the same problem.
> 
> tmpfs is same with all the other filesystems that use get_next_ino(), but we need to solve this problem one by one.
> 
> If tmpfs is ok, we can modify the other filesystems too. Besides, I do not  recommend all file systems share the same
> 
> global variable, for performance impact consideration.
> 
> >
> > There are some other problems I noticed, but they're not worth bringing
> > up until this fundamental design choice is justified.
> Agree, thanks.

Just a rushed FYI without looking at your patch or comments.

Internally (in Google) we do rely on good tmpfs inode numbers more
than on those of other get_next_ino() filesystems, and carry a patch
to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
number space for each superblock) - essentially,

	ino = sbinfo->next_ino++;
	/* Avoid 0 in the low 32 bits: might appear deleted */
	if (unlikely((unsigned int)ino == 0))
		ino = sbinfo->next_ino++;

Which I think would be faster, and need less memory, than IDA.
But whether that is of general interest, or of interest to you,
depends upon how prevalent 32-bit executables built without
__FILE_OFFSET_BITS=64 still are these days.

Hugh
Zheng Bin Nov. 21, 2019, 6:45 a.m. UTC | #5
On 2019/11/21 12:52, Hugh Dickins wrote:
> On Thu, 21 Nov 2019, zhengbin (A) wrote:
>> On 2019/11/20 23:45, Matthew Wilcox wrote:
>>> On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote:
>>>> I have tried to change last_ino type to unsigned long, while this was
>>>> rejected, see details on https://patchwork.kernel.org/patch/11023915.
>>> Did you end up trying sbitmap?
>> Maybe sbitmap is not a good solution, max_inodes of tmpfs are controlled by mount options--nrinodes,
>>
>> which can be modified by remountfs(bigger or smaller), as the comment of function sbitmap_resize says:
>>
>>  * Doesn't reallocate anything. It's up to the caller to ensure that the new
>>  * depth doesn't exceed the depth that the sb was initialized with.
>>
>> We can modify this to meet the growing requirements, there will still be questions as follows:
>>
>> 1. tmpfs is a ram filesystem, we need to allocate sbitmap memory for sbinfo->max_inodes(while this maybe huge)
>>
>> 2.If remountfs changes  max_inode, we have to deal with it, while this may take a long time
>>
>> (bigger: we need to free the old sbitmap memory, allocate new memory, copy the old sbitmap to new sbitmap
>>
>> smaller: How do we deal with it?ie: we use sb->map[inode number/8] to find the sbitmap, we need to change the exist
>>
>> inode numbers?while this maybe used by userspace application.)
>>
>>> What I think is fundamentally wrong with this patch is that you've found a
>>> problem in get_next_ino() and decided to use a different scheme for this
>>> one filesystem, leaving every other filesystem which uses get_next_ino()
>>> facing the same problem.
>>>
>>> That could be acceptable if you explained why tmpfs is fundamentally
>>> different from all the other filesystems that use get_next_ino(), but
>>> you haven't (and I don't think there is such a difference.  eg pipes,
>>> autofs and ipc mqueue could all have the same problem.
>> tmpfs is same with all the other filesystems that use get_next_ino(), but we need to solve this problem one by one.
>>
>> If tmpfs is ok, we can modify the other filesystems too. Besides, I do not  recommend all file systems share the same
>>
>> global variable, for performance impact consideration.
>>
>>> There are some other problems I noticed, but they're not worth bringing
>>> up until this fundamental design choice is justified.
>> Agree, thanks.
> Just a rushed FYI without looking at your patch or comments.
>
> Internally (in Google) we do rely on good tmpfs inode numbers more
> than on those of other get_next_ino() filesystems, and carry a patch
> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
> number space for each superblock) - essentially,
>
> 	ino = sbinfo->next_ino++;
> 	/* Avoid 0 in the low 32 bits: might appear deleted */
> 	if (unlikely((unsigned int)ino == 0))
> 		ino = sbinfo->next_ino++;
>
> Which I think would be faster, and need less memory, than IDA.
> But whether that is of general interest, or of interest to you,
> depends upon how prevalent 32-bit executables built without
> __FILE_OFFSET_BITS=64 still are these days.

So how google think about this? inode number > 32-bit, but 32-bit executables

cat not handle this? "separate inode number space for each superblock" can reduce the

probability, but still can not solve it.

>
> Hugh
J. R. Okajima Nov. 21, 2019, 11:40 a.m. UTC | #6
Hugh Dickins:
> Internally (in Google) we do rely on good tmpfs inode numbers more
> than on those of other get_next_ino() filesystems, and carry a patch
> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
> number space for each superblock) - essentially,
>
> =09ino =3D sbinfo->next_ino++;
> =09/* Avoid 0 in the low 32 bits: might appear deleted */
> =09if (unlikely((unsigned int)ino =3D=3D 0))
> =09=09ino =3D sbinfo->next_ino++;

I agree with that "per superblock inum space", but I don't see your
point.  How can you manage it fully?  I mean how can you decide whether
the new inum is in use or not?
For example,
- you create a file which is assigned inum#10.
- you or other people create and unlink over and over on the same tmpfs.
- then sbinfo->next_ino will become zero, skipped, ok.
- and then it will be 10.
I don't think you want to share the same inum by two inodes.

Moreover, SysV SHM uses tmpfs and shmget(2) overwrite inum internally.
It will be another seed of a similar problem.


J. R. Okajima
Hugh Dickins Nov. 21, 2019, 7:53 p.m. UTC | #7
On Thu, 21 Nov 2019, zhengbin (A) wrote:
> On 2019/11/21 12:52, Hugh Dickins wrote:
> > Just a rushed FYI without looking at your patch or comments.
> >
> > Internally (in Google) we do rely on good tmpfs inode numbers more
> > than on those of other get_next_ino() filesystems, and carry a patch
> > to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
> > number space for each superblock) - essentially,
> >
> > 	ino = sbinfo->next_ino++;
> > 	/* Avoid 0 in the low 32 bits: might appear deleted */
> > 	if (unlikely((unsigned int)ino == 0))
> > 		ino = sbinfo->next_ino++;
> >
> > Which I think would be faster, and need less memory, than IDA.
> > But whether that is of general interest, or of interest to you,
> > depends upon how prevalent 32-bit executables built without
> > __FILE_OFFSET_BITS=64 still are these days.
> 
> So how google think about this? inode number > 32-bit, but 32-bit executables
> cat not handle this?

Google is free to limit what executables are run on its machines,
and how they are built, so little problem here.

A general-purpose 32-bit Linux distribution does not have that freedom,
does not want to limit what the user runs.  But I thought that by now
they (and all serious users of 32-bit systems) were building their own
executables with _FILE_OFFSET_BITS=64 (I was too generous with the
underscores yesterday); and I thought that defined __USE_FILE_OFFSET64,
and that typedef'd ino_t to be __ino64_t.  And the 32-bit kernel would
have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long.

So I thought that a modern, professional 32-bit executable would be
dealing in 64-bit inode numbers anyway.  But I am not a system builder,
so perhaps I'm being naive.  And of course some users may have to support
some old userspace, or apps that assign inode numbers to "int" or "long"
or whatever.  I have no insight into the extent of that problem.

Hugh
Hugh Dickins Nov. 21, 2019, 8:07 p.m. UTC | #8
On Thu, 21 Nov 2019, J. R. Okajima wrote:
> Hugh Dickins:
> > Internally (in Google) we do rely on good tmpfs inode numbers more
> > than on those of other get_next_ino() filesystems, and carry a patch
> > to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
> > number space for each superblock) - essentially,
> >
> > =09ino =3D sbinfo->next_ino++;
> > =09/* Avoid 0 in the low 32 bits: might appear deleted */
> > =09if (unlikely((unsigned int)ino =3D=3D 0))
> > =09=09ino =3D sbinfo->next_ino++;
> 
> I agree with that "per superblock inum space", but I don't see your
> point.  How can you manage it fully?  I mean how can you decide whether
> the new inum is in use or not?
> For example,
> - you create a file which is assigned inum#10.
> - you or other people create and unlink over and over on the same tmpfs.
> - then sbinfo->next_ino will become zero, skipped, ok.
> - and then it will be 10.
> I don't think you want to share the same inum by two inodes.

64 bits. I haven't done the arithmetic to work out the amusing number,
but zhengbin mentioned the script taking 10 days to duplicate an inode
number in 32 bits, so: a larger number of years than I need to care about.

> 
> Moreover, SysV SHM uses tmpfs and shmget(2) overwrite inum internally.
> It will be another seed of a similar problem.

I was totally ignorant of that peculiarity in ipc/shm.c, thanks for
alerting me to it.  But it doesn't affect what we're doing in tmpfs,
and apparently suits the users of SysV SHM: I don't see any need to
worry about it.

Hugh
Zheng Bin Nov. 22, 2019, 1:23 a.m. UTC | #9
On 2019/11/22 3:53, Hugh Dickins wrote:
> On Thu, 21 Nov 2019, zhengbin (A) wrote:
>> On 2019/11/21 12:52, Hugh Dickins wrote:
>>> Just a rushed FYI without looking at your patch or comments.
>>>
>>> Internally (in Google) we do rely on good tmpfs inode numbers more
>>> than on those of other get_next_ino() filesystems, and carry a patch
>>> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
>>> number space for each superblock) - essentially,
>>>
>>> 	ino = sbinfo->next_ino++;
>>> 	/* Avoid 0 in the low 32 bits: might appear deleted */
>>> 	if (unlikely((unsigned int)ino == 0))
>>> 		ino = sbinfo->next_ino++;
>>>
>>> Which I think would be faster, and need less memory, than IDA.
>>> But whether that is of general interest, or of interest to you,
>>> depends upon how prevalent 32-bit executables built without
>>> __FILE_OFFSET_BITS=64 still are these days.
>> So how google think about this? inode number > 32-bit, but 32-bit executables
>> cat not handle this?
> Google is free to limit what executables are run on its machines,
> and how they are built, so little problem here.
>
> A general-purpose 32-bit Linux distribution does not have that freedom,
> does not want to limit what the user runs.  But I thought that by now
> they (and all serious users of 32-bit systems) were building their own
> executables with _FILE_OFFSET_BITS=64 (I was too generous with the
> underscores yesterday); and I thought that defined __USE_FILE_OFFSET64,
> and that typedef'd ino_t to be __ino64_t.  And the 32-bit kernel would
> have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long.
>
> So I thought that a modern, professional 32-bit executable would be
> dealing in 64-bit inode numbers anyway.  But I am not a system builder,
> so perhaps I'm being naive.  And of course some users may have to support
> some old userspace, or apps that assign inode numbers to "int" or "long"
> or whatever.  I have no insight into the extent of that problem.

So how to solve this problem?

1. tmpfs use ida or other data structure

2. tmpfs use 64-bit, each superblock a inode number space

3. do not do anything, If somebody hits this bug, let them solve for themselves

4. (last_ino change to 64-bit)get_next_ino -->other filesystems will be ok, but it was rejected before

>
> Hugh
>
> .
>
Matthew Wilcox (Oracle) Nov. 22, 2019, 10:13 p.m. UTC | #10
On Fri, Nov 22, 2019 at 09:23:30AM +0800, zhengbin (A) wrote:
> On 2019/11/22 3:53, Hugh Dickins wrote:
> > On Thu, 21 Nov 2019, zhengbin (A) wrote:
> >> On 2019/11/21 12:52, Hugh Dickins wrote:
> >>> Just a rushed FYI without looking at your patch or comments.
> >>>
> >>> Internally (in Google) we do rely on good tmpfs inode numbers more
> >>> than on those of other get_next_ino() filesystems, and carry a patch
> >>> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
> >>> number space for each superblock) - essentially,
> >>>
> >>> 	ino = sbinfo->next_ino++;
> >>> 	/* Avoid 0 in the low 32 bits: might appear deleted */
> >>> 	if (unlikely((unsigned int)ino == 0))
> >>> 		ino = sbinfo->next_ino++;
> >>>
> >>> Which I think would be faster, and need less memory, than IDA.
> >>> But whether that is of general interest, or of interest to you,
> >>> depends upon how prevalent 32-bit executables built without
> >>> __FILE_OFFSET_BITS=64 still are these days.
> >> So how google think about this? inode number > 32-bit, but 32-bit executables
> >> cat not handle this?
> > Google is free to limit what executables are run on its machines,
> > and how they are built, so little problem here.
> >
> > A general-purpose 32-bit Linux distribution does not have that freedom,
> > does not want to limit what the user runs.  But I thought that by now
> > they (and all serious users of 32-bit systems) were building their own
> > executables with _FILE_OFFSET_BITS=64 (I was too generous with the
> > underscores yesterday); and I thought that defined __USE_FILE_OFFSET64,
> > and that typedef'd ino_t to be __ino64_t.  And the 32-bit kernel would
> > have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long.
> >
> > So I thought that a modern, professional 32-bit executable would be
> > dealing in 64-bit inode numbers anyway.  But I am not a system builder,
> > so perhaps I'm being naive.  And of course some users may have to support
> > some old userspace, or apps that assign inode numbers to "int" or "long"
> > or whatever.  I have no insight into the extent of that problem.
> 
> So how to solve this problem?
> 
> 1. tmpfs use ida or other data structure
> 
> 2. tmpfs use 64-bit, each superblock a inode number space
> 
> 3. do not do anything, If somebody hits this bug, let them solve for themselves
> 
> 4. (last_ino change to 64-bit)get_next_ino -->other filesystems will be ok, but it was rejected before

5. Extend the sbitmap API to allow for growing the bitmap.  I had a
look at doing that, and it looks hard.  There are a lot of things which
are set up at initialisation and changing them mid-use seems tricky.
Ccing Jens in case he has an opinion.

6. Creating a percpu IDA.  This doesn't seem too hard.  We need a percpu
pointer to an IDA leaf (128 bytes), and a percpu integer which is the
current base for this CPU.  At allocation time, find and set the first
free bit in the leaf, and add on the current base.

If the percpu leaf is full, set the XA_MARK_1 bit on the entry in
the XArray.  Then look for any leaves which have both the XA_MARK_0
and XA_MARK_1 bits set; if there is one, claim it by clearing the
XA_MARK_1 bit.  If not, kzalloc a new one and find a free spot for it
in the underlying XArray.

Freeing an ID is simply ida_free().  That will involve changing the
users of get_next_ino() to call put_ino(), or something.

This should generally result in similar contention between threads as
the current scheme -- accessing a shared resource every 1024 allocations.
Maybe more often as we try to avoid leaving gaps in the data structure,
or maybe less as we reuse IDs.

(I've tried to explain what I want here, but appreciate it may be
inscrutable.  I can try to explain more, or maybe I should just write
the code myself)
Zheng Bin Nov. 23, 2019, 2:16 a.m. UTC | #11
On 2019/11/23 6:13, Matthew Wilcox wrote:
> On Fri, Nov 22, 2019 at 09:23:30AM +0800, zhengbin (A) wrote:
>> On 2019/11/22 3:53, Hugh Dickins wrote:
>>> On Thu, 21 Nov 2019, zhengbin (A) wrote:
>>>> On 2019/11/21 12:52, Hugh Dickins wrote:
>>>>> Just a rushed FYI without looking at your patch or comments.
>>>>>
>>>>> Internally (in Google) we do rely on good tmpfs inode numbers more
>>>>> than on those of other get_next_ino() filesystems, and carry a patch
>>>>> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
>>>>> number space for each superblock) - essentially,
>>>>>
>>>>> 	ino = sbinfo->next_ino++;
>>>>> 	/* Avoid 0 in the low 32 bits: might appear deleted */
>>>>> 	if (unlikely((unsigned int)ino == 0))
>>>>> 		ino = sbinfo->next_ino++;
>>>>>
>>>>> Which I think would be faster, and need less memory, than IDA.
>>>>> But whether that is of general interest, or of interest to you,
>>>>> depends upon how prevalent 32-bit executables built without
>>>>> __FILE_OFFSET_BITS=64 still are these days.
>>>> So how google think about this? inode number > 32-bit, but 32-bit executables
>>>> cat not handle this?
>>> Google is free to limit what executables are run on its machines,
>>> and how they are built, so little problem here.
>>>
>>> A general-purpose 32-bit Linux distribution does not have that freedom,
>>> does not want to limit what the user runs.  But I thought that by now
>>> they (and all serious users of 32-bit systems) were building their own
>>> executables with _FILE_OFFSET_BITS=64 (I was too generous with the
>>> underscores yesterday); and I thought that defined __USE_FILE_OFFSET64,
>>> and that typedef'd ino_t to be __ino64_t.  And the 32-bit kernel would
>>> have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long.
>>>
>>> So I thought that a modern, professional 32-bit executable would be
>>> dealing in 64-bit inode numbers anyway.  But I am not a system builder,
>>> so perhaps I'm being naive.  And of course some users may have to support
>>> some old userspace, or apps that assign inode numbers to "int" or "long"
>>> or whatever.  I have no insight into the extent of that problem.
>> So how to solve this problem?
>>
>> 1. tmpfs use ida or other data structure
>>
>> 2. tmpfs use 64-bit, each superblock a inode number space
>>
>> 3. do not do anything, If somebody hits this bug, let them solve for themselves
>>
>> 4. (last_ino change to 64-bit)get_next_ino -->other filesystems will be ok, but it was rejected before
> 5. Extend the sbitmap API to allow for growing the bitmap.  I had a
> look at doing that, and it looks hard.  There are a lot of things which
> are set up at initialisation and changing them mid-use seems tricky.
> Ccing Jens in case he has an opinion.
>
> 6. Creating a percpu IDA.  This doesn't seem too hard.  We need a percpu
> pointer to an IDA leaf (128 bytes), and a percpu integer which is the
> current base for this CPU.  At allocation time, find and set the first
> free bit in the leaf, and add on the current base.
>
> If the percpu leaf is full, set the XA_MARK_1 bit on the entry in
> the XArray.  Then look for any leaves which have both the XA_MARK_0
> and XA_MARK_1 bits set; if there is one, claim it by clearing the
> XA_MARK_1 bit.  If not, kzalloc a new one and find a free spot for it
> in the underlying XArray.
>
> Freeing an ID is simply ida_free().  That will involve changing the
> users of get_next_ino() to call put_ino(), or something.
>
> This should generally result in similar contention between threads as
> the current scheme -- accessing a shared resource every 1024 allocations.
> Maybe more often as we try to avoid leaving gaps in the data structure,
> or maybe less as we reuse IDs.
>
> (I've tried to explain what I want here, but appreciate it may be
> inscrutable.  I can try to explain more, or maybe I should just write
> the code myself)

I am trying to understand it, if you write the code, I am also very welcome.

By the way, percpu IDA is for reducing performance impact? This patch has 2.16%

performance degradation(Use perf to get the cost of ida_alloc_range)

>
> .
>
Matthew Wilcox (Oracle) Nov. 23, 2019, 2:33 a.m. UTC | #12
On Sat, Nov 23, 2019 at 10:16:39AM +0800, zhengbin (A) wrote:
> By the way, percpu IDA is for reducing performance impact? This patch has 2.16%
> performance degradation(Use perf to get the cost of ida_alloc_range)

2.16% degradation in your tests ... I bet Eric didn't make it so complex
for only 2% performance impact.  Unfortunately, he didn't give any
numbers in his patch submission, but it's going to be a bigger problem
on multi-socket machines than on a laptop.

Eric, any memories from 2010?  Commit
f991bd2e14210fb93d722cb23e54991de20e8a3d if it helps.
Al Viro Nov. 23, 2019, 4:54 a.m. UTC | #13
On Fri, Nov 22, 2019 at 06:33:25PM -0800, Matthew Wilcox wrote:
> On Sat, Nov 23, 2019 at 10:16:39AM +0800, zhengbin (A) wrote:
> > By the way, percpu IDA is for reducing performance impact? This patch has 2.16%
> > performance degradation(Use perf to get the cost of ida_alloc_range)
> 
> 2.16% degradation in your tests ... I bet Eric didn't make it so complex
> for only 2% performance impact.  Unfortunately, he didn't give any
> numbers in his patch submission, but it's going to be a bigger problem
> on multi-socket machines than on a laptop.

It could also be that socket(2) is called (on real loads) just a bit
more often than open(2) in a benchmark that tests file creation...
Zheng Bin Dec. 1, 2019, 8:44 a.m. UTC | #14
On 2019/11/23 6:13, Matthew Wilcox wrote:
> On Fri, Nov 22, 2019 at 09:23:30AM +0800, zhengbin (A) wrote:
>> On 2019/11/22 3:53, Hugh Dickins wrote:
>>> On Thu, 21 Nov 2019, zhengbin (A) wrote:
>>>> On 2019/11/21 12:52, Hugh Dickins wrote:
>>>>> Just a rushed FYI without looking at your patch or comments.
>>>>>
>>>>> Internally (in Google) we do rely on good tmpfs inode numbers more
>>>>> than on those of other get_next_ino() filesystems, and carry a patch
>>>>> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode
>>>>> number space for each superblock) - essentially,
>>>>>
>>>>> 	ino = sbinfo->next_ino++;
>>>>> 	/* Avoid 0 in the low 32 bits: might appear deleted */
>>>>> 	if (unlikely((unsigned int)ino == 0))
>>>>> 		ino = sbinfo->next_ino++;
>>>>>
>>>>> Which I think would be faster, and need less memory, than IDA.
>>>>> But whether that is of general interest, or of interest to you,
>>>>> depends upon how prevalent 32-bit executables built without
>>>>> __FILE_OFFSET_BITS=64 still are these days.
>>>> So how google think about this? inode number > 32-bit, but 32-bit executables
>>>> cat not handle this?
>>> Google is free to limit what executables are run on its machines,
>>> and how they are built, so little problem here.
>>>
>>> A general-purpose 32-bit Linux distribution does not have that freedom,
>>> does not want to limit what the user runs.  But I thought that by now
>>> they (and all serious users of 32-bit systems) were building their own
>>> executables with _FILE_OFFSET_BITS=64 (I was too generous with the
>>> underscores yesterday); and I thought that defined __USE_FILE_OFFSET64,
>>> and that typedef'd ino_t to be __ino64_t.  And the 32-bit kernel would
>>> have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long.
>>>
>>> So I thought that a modern, professional 32-bit executable would be
>>> dealing in 64-bit inode numbers anyway.  But I am not a system builder,
>>> so perhaps I'm being naive.  And of course some users may have to support
>>> some old userspace, or apps that assign inode numbers to "int" or "long"
>>> or whatever.  I have no insight into the extent of that problem.
>> So how to solve this problem?
>>
>> 1. tmpfs use ida or other data structure
>>
>> 2. tmpfs use 64-bit, each superblock a inode number space
>>
>> 3. do not do anything, If somebody hits this bug, let them solve for themselves
>>
>> 4. (last_ino change to 64-bit)get_next_ino -->other filesystems will be ok, but it was rejected before
> 5. Extend the sbitmap API to allow for growing the bitmap.  I had a
> look at doing that, and it looks hard.  There are a lot of things which
> are set up at initialisation and changing them mid-use seems tricky.
> Ccing Jens in case he has an opinion.
>
> 6. Creating a percpu IDA.  This doesn't seem too hard.  We need a percpu
> pointer to an IDA leaf (128 bytes), and a percpu integer which is the
> current base for this CPU.  At allocation time, find and set the first
> free bit in the leaf, and add on the current base.
>
> If the percpu leaf is full, set the XA_MARK_1 bit on the entry in
> the XArray.  Then look for any leaves which have both the XA_MARK_0
> and XA_MARK_1 bits set; if there is one, claim it by clearing the
> XA_MARK_1 bit.  If not, kzalloc a new one and find a free spot for it
> in the underlying XArray.
>
> Freeing an ID is simply ida_free().  That will involve changing the
> users of get_next_ino() to call put_ino(), or something.
>
> This should generally result in similar contention between threads as
> the current scheme -- accessing a shared resource every 1024 allocations.
> Maybe more often as we try to avoid leaving gaps in the data structure,
> or maybe less as we reuse IDs.
>
> (I've tried to explain what I want here, but appreciate it may be
> inscrutable.  I can try to explain more, or maybe I should just write
> the code myself)
Hi willy, do you write this?
>
> .
>
diff mbox series

Patch

diff --git a/mm/shmem.c b/mm/shmem.c
index 5b93877..6b5e01b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -258,6 +258,7 @@  static const struct inode_operations shmem_dir_inode_operations;
 static const struct inode_operations shmem_special_inode_operations;
 static const struct vm_operations_struct shmem_vm_ops;
 static struct file_system_type shmem_fs_type;
+static DEFINE_IDA(shmem_inode_ida);

 bool vma_is_shmem(struct vm_area_struct *vma)
 {
@@ -1138,6 +1139,7 @@  static void shmem_evict_inode(struct inode *inode)

 	simple_xattrs_free(&info->xattrs);
 	WARN_ON(inode->i_blocks);
+	ida_simple_remove(&shmem_inode_ida, inode->i_ino);
 	shmem_free_inode(inode->i_sb);
 	clear_inode(inode);
 }
@@ -2213,13 +2215,20 @@  static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	struct inode *inode;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+	int i_ino;

 	if (shmem_reserve_inode(sb))
 		return NULL;

+	i_ino = ida_simple_get(&shmem_inode_ida, 0, 0, GFP_KERNEL);
+	if (i_ino < 0) {
+		shmem_free_inode(sb);
+		return NULL;
+	}
+
 	inode = new_inode(sb);
 	if (inode) {
-		inode->i_ino = get_next_ino();
+		inode->i_ino = i_ino;
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -2263,8 +2272,11 @@  static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		}

 		lockdep_annotate_inode_mutex_key(inode);
-	} else
+	} else {
+		ida_simple_remove(&shmem_inode_ida, i_ino);
 		shmem_free_inode(sb);
+	}
+
 	return inode;
 }