Message ID | 1574259798-144561-1-git-send-email-zhengbin13@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | tmpfs: use ida to get inode number | expand |
On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote: > I have tried to change last_ino type to unsigned long, while this was > rejected, see details on https://patchwork.kernel.org/patch/11023915. Did you end up trying sbitmap? What I think is fundamentally wrong with this patch is that you've found a problem in get_next_ino() and decided to use a different scheme for this one filesystem, leaving every other filesystem which uses get_next_ino() facing the same problem. That could be acceptable if you explained why tmpfs is fundamentally different from all the other filesystems that use get_next_ino(), but you haven't (and I don't think there is such a difference. eg pipes, autofs and ipc mqueue could all have the same problem. There are some other problems I noticed, but they're not worth bringing up until this fundamental design choice is justified.
On 2019/11/20 23:45, Matthew Wilcox wrote: > On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote: >> I have tried to change last_ino type to unsigned long, while this was >> rejected, see details on https://patchwork.kernel.org/patch/11023915. > Did you end up trying sbitmap? Maybe sbitmap is not a good solution, max_inodes of tmpfs are controlled by mount options--nrinodes, which can be modified by remountfs(bigger or smaller), as the comment of function sbitmap_resize says: * Doesn't reallocate anything. It's up to the caller to ensure that the new * depth doesn't exceed the depth that the sb was initialized with. We can modify this to meet the growing requirements, there will still be questions as follows: 1. tmpfs is a ram filesystem, we need to allocate sbitmap memory for sbinfo->max_inodes(while this maybe huge) 2.If remountfs changes max_inode, we have to deal with it, while this may take a long time (bigger: we need to free the old sbitmap memory, allocate new memory, copy the old sbitmap to new sbitmap smaller: How do we deal with it?ie: we use sb->map[inode number/8] to find the sbitmap, we need to change the exist inode numbers?while this maybe used by userspace application.) > > What I think is fundamentally wrong with this patch is that you've found a > problem in get_next_ino() and decided to use a different scheme for this > one filesystem, leaving every other filesystem which uses get_next_ino() > facing the same problem. > > That could be acceptable if you explained why tmpfs is fundamentally > different from all the other filesystems that use get_next_ino(), but > you haven't (and I don't think there is such a difference. eg pipes, > autofs and ipc mqueue could all have the same problem. tmpfs is same with all the other filesystems that use get_next_ino(), but we need to solve this problem one by one. If tmpfs is ok, we can modify the other filesystems too. Besides, I do not recommend all file systems share the same global variable, for performance impact consideration. > > There are some other problems I noticed, but they're not worth bringing > up until this fundamental design choice is justified. Agree, thanks. >
On Wed, Nov 20, 2019 at 07:45:52AM -0800, Matthew Wilcox wrote: > On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote: > > I have tried to change last_ino type to unsigned long, while this was > > rejected, see details on https://patchwork.kernel.org/patch/11023915. > > Did you end up trying sbitmap? > > What I think is fundamentally wrong with this patch is that you've found a > problem in get_next_ino() and decided to use a different scheme for this > one filesystem, leaving every other filesystem which uses get_next_ino() > facing the same problem. > > That could be acceptable if you explained why tmpfs is fundamentally > different from all the other filesystems that use get_next_ino(), but > you haven't (and I don't think there is such a difference. eg pipes, > autofs and ipc mqueue could all have the same problem. If you think that anyone is willing to pay one hell of a price on each pipe(2)... Note that get_next_ino() is pretty careful about staying within per-cpu stuff most of the time; it hits any cross-CPU traffic only in 1/1024th of calls. This, AFAICS, dirties shared cachelines on each call. And there's a plenty of pipe-heavy workloads, for obvious reasons.
On Thu, 21 Nov 2019, zhengbin (A) wrote: > On 2019/11/20 23:45, Matthew Wilcox wrote: > > On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote: > >> I have tried to change last_ino type to unsigned long, while this was > >> rejected, see details on https://patchwork.kernel.org/patch/11023915. > > Did you end up trying sbitmap? > > Maybe sbitmap is not a good solution, max_inodes of tmpfs are controlled by mount options--nrinodes, > > which can be modified by remountfs(bigger or smaller), as the comment of function sbitmap_resize says: > > * Doesn't reallocate anything. It's up to the caller to ensure that the new > * depth doesn't exceed the depth that the sb was initialized with. > > We can modify this to meet the growing requirements, there will still be questions as follows: > > 1. tmpfs is a ram filesystem, we need to allocate sbitmap memory for sbinfo->max_inodes(while this maybe huge) > > 2.If remountfs changes max_inode, we have to deal with it, while this may take a long time > > (bigger: we need to free the old sbitmap memory, allocate new memory, copy the old sbitmap to new sbitmap > > smaller: How do we deal with it?ie: we use sb->map[inode number/8] to find the sbitmap, we need to change the exist > > inode numbers?while this maybe used by userspace application.) > > > > > What I think is fundamentally wrong with this patch is that you've found a > > problem in get_next_ino() and decided to use a different scheme for this > > one filesystem, leaving every other filesystem which uses get_next_ino() > > facing the same problem. > > > > That could be acceptable if you explained why tmpfs is fundamentally > > different from all the other filesystems that use get_next_ino(), but > > you haven't (and I don't think there is such a difference. eg pipes, > > autofs and ipc mqueue could all have the same problem. > > tmpfs is same with all the other filesystems that use get_next_ino(), but we need to solve this problem one by one. > > If tmpfs is ok, we can modify the other filesystems too. Besides, I do not recommend all file systems share the same > > global variable, for performance impact consideration. > > > > > There are some other problems I noticed, but they're not worth bringing > > up until this fundamental design choice is justified. > Agree, thanks. Just a rushed FYI without looking at your patch or comments. Internally (in Google) we do rely on good tmpfs inode numbers more than on those of other get_next_ino() filesystems, and carry a patch to mm/shmem.c for it to use 64-bit inode numbers (and separate inode number space for each superblock) - essentially, ino = sbinfo->next_ino++; /* Avoid 0 in the low 32 bits: might appear deleted */ if (unlikely((unsigned int)ino == 0)) ino = sbinfo->next_ino++; Which I think would be faster, and need less memory, than IDA. But whether that is of general interest, or of interest to you, depends upon how prevalent 32-bit executables built without __FILE_OFFSET_BITS=64 still are these days. Hugh
On 2019/11/21 12:52, Hugh Dickins wrote: > On Thu, 21 Nov 2019, zhengbin (A) wrote: >> On 2019/11/20 23:45, Matthew Wilcox wrote: >>> On Wed, Nov 20, 2019 at 10:23:18PM +0800, zhengbin wrote: >>>> I have tried to change last_ino type to unsigned long, while this was >>>> rejected, see details on https://patchwork.kernel.org/patch/11023915. >>> Did you end up trying sbitmap? >> Maybe sbitmap is not a good solution, max_inodes of tmpfs are controlled by mount options--nrinodes, >> >> which can be modified by remountfs(bigger or smaller), as the comment of function sbitmap_resize says: >> >> * Doesn't reallocate anything. It's up to the caller to ensure that the new >> * depth doesn't exceed the depth that the sb was initialized with. >> >> We can modify this to meet the growing requirements, there will still be questions as follows: >> >> 1. tmpfs is a ram filesystem, we need to allocate sbitmap memory for sbinfo->max_inodes(while this maybe huge) >> >> 2.If remountfs changes max_inode, we have to deal with it, while this may take a long time >> >> (bigger: we need to free the old sbitmap memory, allocate new memory, copy the old sbitmap to new sbitmap >> >> smaller: How do we deal with it?ie: we use sb->map[inode number/8] to find the sbitmap, we need to change the exist >> >> inode numbers?while this maybe used by userspace application.) >> >>> What I think is fundamentally wrong with this patch is that you've found a >>> problem in get_next_ino() and decided to use a different scheme for this >>> one filesystem, leaving every other filesystem which uses get_next_ino() >>> facing the same problem. >>> >>> That could be acceptable if you explained why tmpfs is fundamentally >>> different from all the other filesystems that use get_next_ino(), but >>> you haven't (and I don't think there is such a difference. eg pipes, >>> autofs and ipc mqueue could all have the same problem. >> tmpfs is same with all the other filesystems that use get_next_ino(), but we need to solve this problem one by one. >> >> If tmpfs is ok, we can modify the other filesystems too. Besides, I do not recommend all file systems share the same >> >> global variable, for performance impact consideration. >> >>> There are some other problems I noticed, but they're not worth bringing >>> up until this fundamental design choice is justified. >> Agree, thanks. > Just a rushed FYI without looking at your patch or comments. > > Internally (in Google) we do rely on good tmpfs inode numbers more > than on those of other get_next_ino() filesystems, and carry a patch > to mm/shmem.c for it to use 64-bit inode numbers (and separate inode > number space for each superblock) - essentially, > > ino = sbinfo->next_ino++; > /* Avoid 0 in the low 32 bits: might appear deleted */ > if (unlikely((unsigned int)ino == 0)) > ino = sbinfo->next_ino++; > > Which I think would be faster, and need less memory, than IDA. > But whether that is of general interest, or of interest to you, > depends upon how prevalent 32-bit executables built without > __FILE_OFFSET_BITS=64 still are these days. So how google think about this? inode number > 32-bit, but 32-bit executables cat not handle this? "separate inode number space for each superblock" can reduce the probability, but still can not solve it. > > Hugh
Hugh Dickins: > Internally (in Google) we do rely on good tmpfs inode numbers more > than on those of other get_next_ino() filesystems, and carry a patch > to mm/shmem.c for it to use 64-bit inode numbers (and separate inode > number space for each superblock) - essentially, > > =09ino =3D sbinfo->next_ino++; > =09/* Avoid 0 in the low 32 bits: might appear deleted */ > =09if (unlikely((unsigned int)ino =3D=3D 0)) > =09=09ino =3D sbinfo->next_ino++; I agree with that "per superblock inum space", but I don't see your point. How can you manage it fully? I mean how can you decide whether the new inum is in use or not? For example, - you create a file which is assigned inum#10. - you or other people create and unlink over and over on the same tmpfs. - then sbinfo->next_ino will become zero, skipped, ok. - and then it will be 10. I don't think you want to share the same inum by two inodes. Moreover, SysV SHM uses tmpfs and shmget(2) overwrite inum internally. It will be another seed of a similar problem. J. R. Okajima
On Thu, 21 Nov 2019, zhengbin (A) wrote: > On 2019/11/21 12:52, Hugh Dickins wrote: > > Just a rushed FYI without looking at your patch or comments. > > > > Internally (in Google) we do rely on good tmpfs inode numbers more > > than on those of other get_next_ino() filesystems, and carry a patch > > to mm/shmem.c for it to use 64-bit inode numbers (and separate inode > > number space for each superblock) - essentially, > > > > ino = sbinfo->next_ino++; > > /* Avoid 0 in the low 32 bits: might appear deleted */ > > if (unlikely((unsigned int)ino == 0)) > > ino = sbinfo->next_ino++; > > > > Which I think would be faster, and need less memory, than IDA. > > But whether that is of general interest, or of interest to you, > > depends upon how prevalent 32-bit executables built without > > __FILE_OFFSET_BITS=64 still are these days. > > So how google think about this? inode number > 32-bit, but 32-bit executables > cat not handle this? Google is free to limit what executables are run on its machines, and how they are built, so little problem here. A general-purpose 32-bit Linux distribution does not have that freedom, does not want to limit what the user runs. But I thought that by now they (and all serious users of 32-bit systems) were building their own executables with _FILE_OFFSET_BITS=64 (I was too generous with the underscores yesterday); and I thought that defined __USE_FILE_OFFSET64, and that typedef'd ino_t to be __ino64_t. And the 32-bit kernel would have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long. So I thought that a modern, professional 32-bit executable would be dealing in 64-bit inode numbers anyway. But I am not a system builder, so perhaps I'm being naive. And of course some users may have to support some old userspace, or apps that assign inode numbers to "int" or "long" or whatever. I have no insight into the extent of that problem. Hugh
On Thu, 21 Nov 2019, J. R. Okajima wrote: > Hugh Dickins: > > Internally (in Google) we do rely on good tmpfs inode numbers more > > than on those of other get_next_ino() filesystems, and carry a patch > > to mm/shmem.c for it to use 64-bit inode numbers (and separate inode > > number space for each superblock) - essentially, > > > > =09ino =3D sbinfo->next_ino++; > > =09/* Avoid 0 in the low 32 bits: might appear deleted */ > > =09if (unlikely((unsigned int)ino =3D=3D 0)) > > =09=09ino =3D sbinfo->next_ino++; > > I agree with that "per superblock inum space", but I don't see your > point. How can you manage it fully? I mean how can you decide whether > the new inum is in use or not? > For example, > - you create a file which is assigned inum#10. > - you or other people create and unlink over and over on the same tmpfs. > - then sbinfo->next_ino will become zero, skipped, ok. > - and then it will be 10. > I don't think you want to share the same inum by two inodes. 64 bits. I haven't done the arithmetic to work out the amusing number, but zhengbin mentioned the script taking 10 days to duplicate an inode number in 32 bits, so: a larger number of years than I need to care about. > > Moreover, SysV SHM uses tmpfs and shmget(2) overwrite inum internally. > It will be another seed of a similar problem. I was totally ignorant of that peculiarity in ipc/shm.c, thanks for alerting me to it. But it doesn't affect what we're doing in tmpfs, and apparently suits the users of SysV SHM: I don't see any need to worry about it. Hugh
On 2019/11/22 3:53, Hugh Dickins wrote: > On Thu, 21 Nov 2019, zhengbin (A) wrote: >> On 2019/11/21 12:52, Hugh Dickins wrote: >>> Just a rushed FYI without looking at your patch or comments. >>> >>> Internally (in Google) we do rely on good tmpfs inode numbers more >>> than on those of other get_next_ino() filesystems, and carry a patch >>> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode >>> number space for each superblock) - essentially, >>> >>> ino = sbinfo->next_ino++; >>> /* Avoid 0 in the low 32 bits: might appear deleted */ >>> if (unlikely((unsigned int)ino == 0)) >>> ino = sbinfo->next_ino++; >>> >>> Which I think would be faster, and need less memory, than IDA. >>> But whether that is of general interest, or of interest to you, >>> depends upon how prevalent 32-bit executables built without >>> __FILE_OFFSET_BITS=64 still are these days. >> So how google think about this? inode number > 32-bit, but 32-bit executables >> cat not handle this? > Google is free to limit what executables are run on its machines, > and how they are built, so little problem here. > > A general-purpose 32-bit Linux distribution does not have that freedom, > does not want to limit what the user runs. But I thought that by now > they (and all serious users of 32-bit systems) were building their own > executables with _FILE_OFFSET_BITS=64 (I was too generous with the > underscores yesterday); and I thought that defined __USE_FILE_OFFSET64, > and that typedef'd ino_t to be __ino64_t. And the 32-bit kernel would > have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long. > > So I thought that a modern, professional 32-bit executable would be > dealing in 64-bit inode numbers anyway. But I am not a system builder, > so perhaps I'm being naive. And of course some users may have to support > some old userspace, or apps that assign inode numbers to "int" or "long" > or whatever. I have no insight into the extent of that problem. So how to solve this problem? 1. tmpfs use ida or other data structure 2. tmpfs use 64-bit, each superblock a inode number space 3. do not do anything, If somebody hits this bug, let them solve for themselves 4. (last_ino change to 64-bit)get_next_ino -->other filesystems will be ok, but it was rejected before > > Hugh > > . >
On Fri, Nov 22, 2019 at 09:23:30AM +0800, zhengbin (A) wrote: > On 2019/11/22 3:53, Hugh Dickins wrote: > > On Thu, 21 Nov 2019, zhengbin (A) wrote: > >> On 2019/11/21 12:52, Hugh Dickins wrote: > >>> Just a rushed FYI without looking at your patch or comments. > >>> > >>> Internally (in Google) we do rely on good tmpfs inode numbers more > >>> than on those of other get_next_ino() filesystems, and carry a patch > >>> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode > >>> number space for each superblock) - essentially, > >>> > >>> ino = sbinfo->next_ino++; > >>> /* Avoid 0 in the low 32 bits: might appear deleted */ > >>> if (unlikely((unsigned int)ino == 0)) > >>> ino = sbinfo->next_ino++; > >>> > >>> Which I think would be faster, and need less memory, than IDA. > >>> But whether that is of general interest, or of interest to you, > >>> depends upon how prevalent 32-bit executables built without > >>> __FILE_OFFSET_BITS=64 still are these days. > >> So how google think about this? inode number > 32-bit, but 32-bit executables > >> cat not handle this? > > Google is free to limit what executables are run on its machines, > > and how they are built, so little problem here. > > > > A general-purpose 32-bit Linux distribution does not have that freedom, > > does not want to limit what the user runs. But I thought that by now > > they (and all serious users of 32-bit systems) were building their own > > executables with _FILE_OFFSET_BITS=64 (I was too generous with the > > underscores yesterday); and I thought that defined __USE_FILE_OFFSET64, > > and that typedef'd ino_t to be __ino64_t. And the 32-bit kernel would > > have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long. > > > > So I thought that a modern, professional 32-bit executable would be > > dealing in 64-bit inode numbers anyway. But I am not a system builder, > > so perhaps I'm being naive. And of course some users may have to support > > some old userspace, or apps that assign inode numbers to "int" or "long" > > or whatever. I have no insight into the extent of that problem. > > So how to solve this problem? > > 1. tmpfs use ida or other data structure > > 2. tmpfs use 64-bit, each superblock a inode number space > > 3. do not do anything, If somebody hits this bug, let them solve for themselves > > 4. (last_ino change to 64-bit)get_next_ino -->other filesystems will be ok, but it was rejected before 5. Extend the sbitmap API to allow for growing the bitmap. I had a look at doing that, and it looks hard. There are a lot of things which are set up at initialisation and changing them mid-use seems tricky. Ccing Jens in case he has an opinion. 6. Creating a percpu IDA. This doesn't seem too hard. We need a percpu pointer to an IDA leaf (128 bytes), and a percpu integer which is the current base for this CPU. At allocation time, find and set the first free bit in the leaf, and add on the current base. If the percpu leaf is full, set the XA_MARK_1 bit on the entry in the XArray. Then look for any leaves which have both the XA_MARK_0 and XA_MARK_1 bits set; if there is one, claim it by clearing the XA_MARK_1 bit. If not, kzalloc a new one and find a free spot for it in the underlying XArray. Freeing an ID is simply ida_free(). That will involve changing the users of get_next_ino() to call put_ino(), or something. This should generally result in similar contention between threads as the current scheme -- accessing a shared resource every 1024 allocations. Maybe more often as we try to avoid leaving gaps in the data structure, or maybe less as we reuse IDs. (I've tried to explain what I want here, but appreciate it may be inscrutable. I can try to explain more, or maybe I should just write the code myself)
On 2019/11/23 6:13, Matthew Wilcox wrote: > On Fri, Nov 22, 2019 at 09:23:30AM +0800, zhengbin (A) wrote: >> On 2019/11/22 3:53, Hugh Dickins wrote: >>> On Thu, 21 Nov 2019, zhengbin (A) wrote: >>>> On 2019/11/21 12:52, Hugh Dickins wrote: >>>>> Just a rushed FYI without looking at your patch or comments. >>>>> >>>>> Internally (in Google) we do rely on good tmpfs inode numbers more >>>>> than on those of other get_next_ino() filesystems, and carry a patch >>>>> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode >>>>> number space for each superblock) - essentially, >>>>> >>>>> ino = sbinfo->next_ino++; >>>>> /* Avoid 0 in the low 32 bits: might appear deleted */ >>>>> if (unlikely((unsigned int)ino == 0)) >>>>> ino = sbinfo->next_ino++; >>>>> >>>>> Which I think would be faster, and need less memory, than IDA. >>>>> But whether that is of general interest, or of interest to you, >>>>> depends upon how prevalent 32-bit executables built without >>>>> __FILE_OFFSET_BITS=64 still are these days. >>>> So how google think about this? inode number > 32-bit, but 32-bit executables >>>> cat not handle this? >>> Google is free to limit what executables are run on its machines, >>> and how they are built, so little problem here. >>> >>> A general-purpose 32-bit Linux distribution does not have that freedom, >>> does not want to limit what the user runs. But I thought that by now >>> they (and all serious users of 32-bit systems) were building their own >>> executables with _FILE_OFFSET_BITS=64 (I was too generous with the >>> underscores yesterday); and I thought that defined __USE_FILE_OFFSET64, >>> and that typedef'd ino_t to be __ino64_t. And the 32-bit kernel would >>> have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long. >>> >>> So I thought that a modern, professional 32-bit executable would be >>> dealing in 64-bit inode numbers anyway. But I am not a system builder, >>> so perhaps I'm being naive. And of course some users may have to support >>> some old userspace, or apps that assign inode numbers to "int" or "long" >>> or whatever. I have no insight into the extent of that problem. >> So how to solve this problem? >> >> 1. tmpfs use ida or other data structure >> >> 2. tmpfs use 64-bit, each superblock a inode number space >> >> 3. do not do anything, If somebody hits this bug, let them solve for themselves >> >> 4. (last_ino change to 64-bit)get_next_ino -->other filesystems will be ok, but it was rejected before > 5. Extend the sbitmap API to allow for growing the bitmap. I had a > look at doing that, and it looks hard. There are a lot of things which > are set up at initialisation and changing them mid-use seems tricky. > Ccing Jens in case he has an opinion. > > 6. Creating a percpu IDA. This doesn't seem too hard. We need a percpu > pointer to an IDA leaf (128 bytes), and a percpu integer which is the > current base for this CPU. At allocation time, find and set the first > free bit in the leaf, and add on the current base. > > If the percpu leaf is full, set the XA_MARK_1 bit on the entry in > the XArray. Then look for any leaves which have both the XA_MARK_0 > and XA_MARK_1 bits set; if there is one, claim it by clearing the > XA_MARK_1 bit. If not, kzalloc a new one and find a free spot for it > in the underlying XArray. > > Freeing an ID is simply ida_free(). That will involve changing the > users of get_next_ino() to call put_ino(), or something. > > This should generally result in similar contention between threads as > the current scheme -- accessing a shared resource every 1024 allocations. > Maybe more often as we try to avoid leaving gaps in the data structure, > or maybe less as we reuse IDs. > > (I've tried to explain what I want here, but appreciate it may be > inscrutable. I can try to explain more, or maybe I should just write > the code myself) I am trying to understand it, if you write the code, I am also very welcome. By the way, percpu IDA is for reducing performance impact? This patch has 2.16% performance degradation(Use perf to get the cost of ida_alloc_range) > > . >
On Sat, Nov 23, 2019 at 10:16:39AM +0800, zhengbin (A) wrote: > By the way, percpu IDA is for reducing performance impact? This patch has 2.16% > performance degradation(Use perf to get the cost of ida_alloc_range) 2.16% degradation in your tests ... I bet Eric didn't make it so complex for only 2% performance impact. Unfortunately, he didn't give any numbers in his patch submission, but it's going to be a bigger problem on multi-socket machines than on a laptop. Eric, any memories from 2010? Commit f991bd2e14210fb93d722cb23e54991de20e8a3d if it helps.
On Fri, Nov 22, 2019 at 06:33:25PM -0800, Matthew Wilcox wrote: > On Sat, Nov 23, 2019 at 10:16:39AM +0800, zhengbin (A) wrote: > > By the way, percpu IDA is for reducing performance impact? This patch has 2.16% > > performance degradation(Use perf to get the cost of ida_alloc_range) > > 2.16% degradation in your tests ... I bet Eric didn't make it so complex > for only 2% performance impact. Unfortunately, he didn't give any > numbers in his patch submission, but it's going to be a bigger problem > on multi-socket machines than on a laptop. It could also be that socket(2) is called (on real loads) just a bit more often than open(2) in a benchmark that tests file creation...
On 2019/11/23 6:13, Matthew Wilcox wrote: > On Fri, Nov 22, 2019 at 09:23:30AM +0800, zhengbin (A) wrote: >> On 2019/11/22 3:53, Hugh Dickins wrote: >>> On Thu, 21 Nov 2019, zhengbin (A) wrote: >>>> On 2019/11/21 12:52, Hugh Dickins wrote: >>>>> Just a rushed FYI without looking at your patch or comments. >>>>> >>>>> Internally (in Google) we do rely on good tmpfs inode numbers more >>>>> than on those of other get_next_ino() filesystems, and carry a patch >>>>> to mm/shmem.c for it to use 64-bit inode numbers (and separate inode >>>>> number space for each superblock) - essentially, >>>>> >>>>> ino = sbinfo->next_ino++; >>>>> /* Avoid 0 in the low 32 bits: might appear deleted */ >>>>> if (unlikely((unsigned int)ino == 0)) >>>>> ino = sbinfo->next_ino++; >>>>> >>>>> Which I think would be faster, and need less memory, than IDA. >>>>> But whether that is of general interest, or of interest to you, >>>>> depends upon how prevalent 32-bit executables built without >>>>> __FILE_OFFSET_BITS=64 still are these days. >>>> So how google think about this? inode number > 32-bit, but 32-bit executables >>>> cat not handle this? >>> Google is free to limit what executables are run on its machines, >>> and how they are built, so little problem here. >>> >>> A general-purpose 32-bit Linux distribution does not have that freedom, >>> does not want to limit what the user runs. But I thought that by now >>> they (and all serious users of 32-bit systems) were building their own >>> executables with _FILE_OFFSET_BITS=64 (I was too generous with the >>> underscores yesterday); and I thought that defined __USE_FILE_OFFSET64, >>> and that typedef'd ino_t to be __ino64_t. And the 32-bit kernel would >>> have __ARCH_WANT_STAT64, which delivers st_ino as unsigned long long. >>> >>> So I thought that a modern, professional 32-bit executable would be >>> dealing in 64-bit inode numbers anyway. But I am not a system builder, >>> so perhaps I'm being naive. And of course some users may have to support >>> some old userspace, or apps that assign inode numbers to "int" or "long" >>> or whatever. I have no insight into the extent of that problem. >> So how to solve this problem? >> >> 1. tmpfs use ida or other data structure >> >> 2. tmpfs use 64-bit, each superblock a inode number space >> >> 3. do not do anything, If somebody hits this bug, let them solve for themselves >> >> 4. (last_ino change to 64-bit)get_next_ino -->other filesystems will be ok, but it was rejected before > 5. Extend the sbitmap API to allow for growing the bitmap. I had a > look at doing that, and it looks hard. There are a lot of things which > are set up at initialisation and changing them mid-use seems tricky. > Ccing Jens in case he has an opinion. > > 6. Creating a percpu IDA. This doesn't seem too hard. We need a percpu > pointer to an IDA leaf (128 bytes), and a percpu integer which is the > current base for this CPU. At allocation time, find and set the first > free bit in the leaf, and add on the current base. > > If the percpu leaf is full, set the XA_MARK_1 bit on the entry in > the XArray. Then look for any leaves which have both the XA_MARK_0 > and XA_MARK_1 bits set; if there is one, claim it by clearing the > XA_MARK_1 bit. If not, kzalloc a new one and find a free spot for it > in the underlying XArray. > > Freeing an ID is simply ida_free(). That will involve changing the > users of get_next_ino() to call put_ino(), or something. > > This should generally result in similar contention between threads as > the current scheme -- accessing a shared resource every 1024 allocations. > Maybe more often as we try to avoid leaving gaps in the data structure, > or maybe less as we reuse IDs. > > (I've tried to explain what I want here, but appreciate it may be > inscrutable. I can try to explain more, or maybe I should just write > the code myself) Hi willy, do you write this? > > . >
diff --git a/mm/shmem.c b/mm/shmem.c index 5b93877..6b5e01b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -258,6 +258,7 @@ static const struct inode_operations shmem_dir_inode_operations; static const struct inode_operations shmem_special_inode_operations; static const struct vm_operations_struct shmem_vm_ops; static struct file_system_type shmem_fs_type; +static DEFINE_IDA(shmem_inode_ida); bool vma_is_shmem(struct vm_area_struct *vma) { @@ -1138,6 +1139,7 @@ static void shmem_evict_inode(struct inode *inode) simple_xattrs_free(&info->xattrs); WARN_ON(inode->i_blocks); + ida_simple_remove(&shmem_inode_ida, inode->i_ino); shmem_free_inode(inode->i_sb); clear_inode(inode); } @@ -2213,13 +2215,20 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode struct inode *inode; struct shmem_inode_info *info; struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + int i_ino; if (shmem_reserve_inode(sb)) return NULL; + i_ino = ida_simple_get(&shmem_inode_ida, 0, 0, GFP_KERNEL); + if (i_ino < 0) { + shmem_free_inode(sb); + return NULL; + } + inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = i_ino; inode_init_owner(inode, dir, mode); inode->i_blocks = 0; inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); @@ -2263,8 +2272,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode } lockdep_annotate_inode_mutex_key(inode); - } else + } else { + ida_simple_remove(&shmem_inode_ida, i_ino); shmem_free_inode(sb); + } + return inode; }
Use a script to test tmpfs, after 10 days, there will be files share the same inode number, thus bug happens. The script is as follows: while(1) { create a file dlopen it ... remove it } I have tried to change last_ino type to unsigned long, while this was rejected, see details on https://patchwork.kernel.org/patch/11023915. Use ida to get inode number, from the fs_mark test, performance impact is small. CPU core: 128 memory: 8G Use fs_mark to create ten million files first in tmpfs. Then test performance in the following command(test five times): rm -rf /tmp/fsmark_test sync echo 3 > /proc/sys/vm/drop_caches sleep 10 fs_mark -t 20 -s 0 -n 102400 -D 64 -N 1600 -L 3 -d /tmp/fsmark_test result is file creation speed(Files/sec). get_next_ino | use ida 439506.7 | 423219.0 441453.0 | 406832.7 439868.0 | 441283.3 445642.3 | 428221.3 441776.7 | 438129.0 average: 441649.34 | 427537.06 Signed-off-by: zhengbin <zhengbin13@huawei.com> --- mm/shmem.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) -- 2.7.4