Optimal IO
2023-08-24
dejbug.github.ioOptimal IO
I had to check two files for equality in Python and was about to usemmap but started wondering what size its preferred input buffer was.
Someone So pointed out that it used to be hard-coded to 8192 hg:py. Well, even in 3.11 it still does look like it gh Well, kinda.
#tldr; Everything is explained in _pyio.py gh.
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device's "block size" and falling back on io.DEFAULT_BUFFER_SIZE. gh
Alas! I've read these lines too late. And because I knew the block size was available with stat I had with that a pretty little trap laid out for me already.
$ stat /tmp/tags/GTAGS File: /tmp/tags/GTAGS Size: 52051968 Blocks: 101664 IO Block: 4096 regular file Device: #,## Inode: ### Links: # Access: (0644/-rw-r--r--) Uid: (#####/xxxxxxxx) Gid: (#####/xxxxxxxx) Access: 2023-08-23 16:30:49.290685452 +0200 Modify: 2023-08-23 15:51:15.817349681 +0200 Change: 2023-08-23 15:51:15.820683015 +0200 Birth: 2023-08-23 15:51:11.780683011 +0200 $ stat -c'The "optimal I/O transfer size hint" for %n is %o.' /tmp/tags/GTAGS The "optimal I/O transfer size hint" for /tmp/tags/GTAGS is 4096.
os.system or subprocess.run but os.stat was just perplexing me!
$ python -c 'import os; print(os.stat("/tmp/tags/GTAGS"))'
os.stat_result(st_mode=33188, st_ino=###, st_dev=##, st_nlink=#, st_uid=#####,
st_gid=#####, st_size=52051968, st_atime=1692801049, st_mtime=1692798675,
st_ctime=1692798675)
$ sed -n '61p' /usr/include/bits/struct_stat.h
__blksize_t st_blksize; /* Optimal block size for I/O. */
sys/stat.h, bits/types.h, bits/typesizes.h, bits/struct_stat.h trying to recreate struct stat in ctypes gh. Here it is.
class stat(ctypes.Structure):
_fields_ = [
('st_dev', ctypes.c_ulonglong),
('st_ino', ctypes.c_ulonglong),
('st_nlink', ctypes.c_ulonglong),
('st_mode', ctypes.c_uint),
('st_uid', ctypes.c_uint),
('st_gid', ctypes.c_uint),
('__pad0', ctypes.c_uint),
('st_rdev', ctypes.c_ulonglong),
('st_size', ctypes.c_longlong),
('st_blksize', ctypes.c_longlong),
('st_blocks', ctypes.c_longlong),
('st_atim', ctypes.c_longlong),
('st_atimensec', ctypes.c_ulong),
('st_mtim', ctypes.c_longlong),
('st_mtimensec', ctypes.c_ulong),
('st_ctim', ctypes.c_longlong),
('st_ctimensec', ctypes.c_ulong),
('__glibc_reserved', ctypes.c_long * 3)
]
printf("%d\n", sizeof(stat::st_ino)); (g++) etc. and experimentally determine signedness. But I guess I just wanted to do it the hard way. One of those days.
So one file would point me to another and the other back. Like nobody wanted anything to do with me. For example. #struct_stat.h will tell you that stat::st_ino is of __ino_t type. E.g. find /usr/include/ -iname '*.h' | xargs grep __ino_t will tell you that if you looked in #types.h you would see that __ino_t is of type __INO_T_TYPE. E.g. gtags -C /usr/include/ /tmp/tags/ && global -C /tmp/tags/ __INO_T_TYPE gnu gh would refer you kindly to #typesizes.h. There you'll see that __INO_T_TYPE is a __SYSCALL_ULONG_TYPE which is either __UQUAD_TYPE or __ULONGWORD_TYPE depending on compile-time flags ibm. This would send you back to #types.h to tell you that __UQUAD_TYPE is an __uint64_t and so on.
It was only later that I stumbled over _pyio.open and saw that st_blksize was indeed a thing in Python's stat gh.
Compare and contrast:
$ python -c 'import os; print(os.stat("/tmp/tags/GTAGS"))'
os.stat_result(st_mode=33188, st_ino=###, st_dev=##, st_nlink=#, st_uid=#####,
st_gid=#####, st_size=52051968, st_atime=1692801049, st_mtime=1692798675,
st_ctime=1692798675)
$ python -c 'import os; print(os.stat("/tmp/tags/GTAGS").st_blksize)'
4096