io_uring - part I
Inspired by https://notes.eatonphil.com/2023-10-19-write-file-to-disk-with-io_uring.html, I decided to dig deeper into io_uring
as it’s a modern Linux API for IO use cases
- Personally, I wanted to explore it for databases - disk I/O
What is io_uring
tldr; There are 2 ring buffers. One for submitting I/O requests (possible to do in batches) to the kernel, the other to receive the results from the kernel. The kernel interacts with the buffer directly. io_uring reduces syscalls and reduces copying between buffers.
https://unixism.net/loti/what_is_io_uring.html
Using fio to improve my understanding
Like https://notes.eatonphil.com/2023-10-19-write-file-to-disk-with-io_uring.html, I tried to write my own tests using giouring and iouring-go to benchmark writes.
It didn’t seem like there was much differences in writes. At this point, there’s a possibility that my tests/the library used were not written optimally. So I decided to try out fio
.
fio
is written by the Jens Axboe, who also wrote io_uring
. This will definitely the best representation.
# My kernel version
Linux 6.8.0-79-generic
$ fio --version
fio-3.28
Comparing sequential writes
# sync
# IOPS=94.1k, BW=370MiB/s
fio --name=seq-write-test-sync --ioengine=sync --rw=write --size=1G --bs=4k --end_fsync=1
# io_uring 1 entry, submit 1 at a time, complete 1 at a time
# IOPS=66.9k, BW=261MiB/s
fio --name=seq-write-test-io-uring --ioengine=io_uring --rw=write --size=1G --bs=4k --size=1G --end_fsync=1
# io_uring 5 entries, submit 5 at a time, complete 1 - 5
# IOPS=94.0k, BW=367MiB/s
fio --name=seq-write-test-io-uring-batch --ioengine=io_uring --rw=write --size=1G --bs=4k --size=1G --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5 --end_fsync=1
Comparing random writes
# sync
# IOPS=92.4k, BW=361MiB/s
fio --name=random-write-test-sync --ioengine=sync --rw=randwrite --size=1G --bs=4k --end_fsync=1
# io_uring 1 entry, submit 1 at a time, complete 1 at a time
# IOPS=65.8k, BW=257MiB/s
fio --name=random-write-test-io-uring --ioengine=io_uring --rw=randwrite --size=1G --bs=4k --size=1G --end_fsync=1
# io_uring 5 entries, submit 5 at a time, complete 1 - 5
# IOPS=92.9k, BW=363MiB/s
fio --name=random-write-test-io-uring-batch --ioengine=io_uring --rw=randwrite --size=1G --bs=4k --size=1G --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5 --end_fsync=1
It’s interesting to note that:
- The basic io_uring configuration (1 entry, submit 1 at a time, complete 1 at a time) was worse than sync
- The batching io_uring configuration is about on-par as sync
- The above 2 points apply to both sequential and random writes
- Sequential writes and random writes perform similarly
- This goes against my understanding that SSD random writes require more work (i.e. erase + write) than sequential writes. There could be an optimisation down the stack.
Comparing sequential reads
# create file to test
fio --name=create_file --rw=write --bs=4k --filename=test_file.dat --size=2G --end_fsync=1
# sync
# IOPS=99.3k, BW=388MiB/s
fio --name=seq-read-test-sync --rw=read --bs=4k --filename=test_file.dat --size=2G --ioengine=sync
# io_uring 1 entry, submit 1 at a time, complete 1 at a time
# IOPS=97.7k, BW=381MiB/s
fio --name=seq-read-test-io-uring --rw=read --bs=4k --filename=test_file.dat --size=2G --ioengine=io_uring
# io_uring 5 entries, submit 5 at a time, complete 1 - 5
# IOPS=97.8k, BW=382MiB/s
fio --name=seq-read-test-io-uring-batch --rw=read --bs=4k --filename=test_file.dat --size=2G --ioengine=io_uring --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5
Comparing random reads
# sync
# IOPS=9284, BW=36.3MiB/s
fio --name=random-read-test-sync --rw=randread --bs=4k --filename=test_file.dat --size=2G --ioengine=sync
# io_uring 1 entry, submit 1 at a time, complete 1 at a time
# IOPS=9302, BW=36.3MiB/s
fio --name=random-read-test-io-uring --rw=randread --bs=4k --filename=test_file.dat --size=2G --ioengine=io_uring
# io_uring 5 entries, submit 5 at a time, complete 1 - 5
# IOPS=42.4k, BW=166MiB/s
fio --name=random-read-test-io-uring-batch --rw=randread --bs=4k --filename=test_file.dat --size=2G --ioengine=io_uring --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5
For sequential reads,
- sync, basic io_uring and advanced io_uring perform similarly
- sync performing slightly better than io_uring
For random reads,
- The basic io_uring configuration is about on-par as sync
- The batching io_uring configuration outperforms sync by close to 5x
Why is the write performance similar
I tried to strace
the write tests. Sadly, using strace slows down the sync test significantly. This is most likely due to the strace overhead caused by the no. of syscalls. See write strace(s) below for the traces
- sync has 500k syscalls, io_uring_basic has 380k, io_uring_batch only has 110k.
My guess is that writes are super fast due to writes not hitting the disk as much (i.e. RAM cache buffering all writes before flushing to disk). Because of this, the benefit from io_uring is not as apparent.
Comparing this against reads where we really make the disk work hard, as seen by the lower IOPS. In such cases, the async nature of io_uring provided much obvious benefits (i.e. we don’t need to busy-wait while data is being fetched from disk).
Why is the sequential read performance similar
So I found out that the kernel merges IO requests when reading from the same block to improve performance.
Even though we’re supposedly fetching 4KB blocks from the disk each time, the kernel actually accumulates these requests and requests a larger block from the disk.
To support this idea, I tried running with block size of 8KB
# 4KB blocks
# IOPS=99.3k, BW=388MiB/s
fio --name=seq-read-test-sync --rw=read --bs=4k --filename=test_file.dat --size=2G --ioengine=sync
# 8KB blocks
# IOPS=48.7k, BW=380MiB/s
fio --name=seq-read-test-sync --rw=read --bs=8k --filename=test_file.dat --size=2G --ioengine=sync
IOPS is obviously halved. But the bandwidth is around the same. If the IO requests were not merged, I’m sure the bandwidth would show a significant difference.
With this in mind, we now know that we’re not actually hitting the disk as much as we expected. Hence, the async nature of io_uring don’t really get the chance to shine.
Sidenote: My PC uses and SSD and is using the mq-deadline I/O scheduler. The article quoted above on merging IO requests was in the context of HDDs. I would assume that the concept applies for SSDs.
$ cat /sys/block/sda/queue/scheduler
none [mq-deadline]
Thoughts
From this test, it seems like a naive usage of io_uring will only bring huge benefits to random read workloads but not much for sequential reads and writes.
Though, there are still some unexplored points:
-
if sync and io_uring perform similarly for sequential reads when reading multiple files concurrently
- Given what we know about merging, if the files are in different blocks, the IO requests may not coalesce. This can result in more actual disk reads and io_uring can help a lot here
What’s next
I will experiment using io_uring to an open source Go-based database with a focus on the read path. Will follow-up with another post.
Appendix: write strace(s)
$ strace --summary-only -f fio --name=random-write-test-sync --ioengine=sync --rw=randwrite --size=1G --bs=4k --end_fsync=1
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
38.27 2.820980 55313 51 pselect6
37.87 2.791422 3125 893 wait4
15.92 1.173743 4 262171 write
5.30 0.390505 1 261883 lseek
2.25 0.165909 165909 1 fsync
0.11 0.008233 2 3646 915 newfstatat
0.10 0.007042 7 903 1 clock_nanosleep
0.07 0.004814 100 48 futex
0.03 0.001904 3 588 31 openat
0.02 0.001223 1 860 getdents64
0.01 0.001100 91 12 sched_setaffinity
0.01 0.000799 1 560 close
0.01 0.000636 2 228 51 read
0.01 0.000618 2 269 mmap
0.01 0.000442 442 1 fallocate
0.00 0.000357 357 1 clone
0.00 0.000271 3 82 mprotect
0.00 0.000271 20 13 clone3
0.00 0.000262 6 42 munmap
0.00 0.000195 3 53 rt_sigprocmask
0.00 0.000173 173 1 shmdt
0.00 0.000132 2 50 ioctl
0.00 0.000098 7 13 madvise
0.00 0.000098 1 51 timerfd_settime
0.00 0.000065 4 15 set_robust_list
0.00 0.000051 4 12 gettid
0.00 0.000049 3 14 rseq
0.00 0.000028 3 9 brk
0.00 0.000026 13 2 getrusage
0.00 0.000025 8 3 fadvise64
0.00 0.000014 7 2 getpriority
0.00 0.000010 5 2 2 statfs
0.00 0.000008 8 1 setpriority
0.00 0.000006 6 1 shmget
0.00 0.000006 6 1 shmat
0.00 0.000005 1 4 fchdir
0.00 0.000004 2 2 sysinfo
0.00 0.000004 1 3 sched_getaffinity
0.00 0.000003 1 2 2 access
0.00 0.000003 1 2 1 shmctl
0.00 0.000002 2 1 set_tid_address
0.00 0.000002 2 1 getrandom
0.00 0.000001 0 4 rt_sigaction
0.00 0.000001 0 3 getpid
0.00 0.000001 1 1 fcntl
0.00 0.000001 0 2 1 arch_prctl
0.00 0.000001 1 1 restart_syscall
0.00 0.000001 1 1 prlimit64
0.00 0.000000 0 4 pread64
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 1 unlink
0.00 0.000000 0 1 setsid
0.00 0.000000 0 1 timerfd_create
0.00 0.000000 0 1 pipe2
------ ----------- ----------- --------- --------- ------------------
100.00 7.371544 13 532518 1005 total
$ strace --summary-only -f fio --name=random-write-test-io-uring --ioengine=io_uring --rw=randwrite --size=1G --bs=4k --size=1G --end_fsync=1
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
41.95 3.138922 84835 37 pselect6
41.64 3.115477 4771 653 wait4
14.14 1.058010 3 350742 io_uring_enter
1.95 0.145944 72972 2 fadvise64
0.09 0.007008 2 3396 673 newfstatat
0.09 0.006491 113 57 3 futex
0.06 0.004437 6 663 1 clock_nanosleep
0.02 0.001335 2 577 31 openat
0.01 0.001024 1 860 getdents64
0.01 0.000607 1 550 close
0.01 0.000423 1 272 mmap
0.00 0.000357 1 190 37 read
0.00 0.000350 350 1 clone
0.00 0.000318 7 45 munmap
0.00 0.000249 19 13 clone3
0.00 0.000211 2 82 mprotect
0.00 0.000208 17 12 sched_setaffinity
0.00 0.000167 167 1 shmdt
0.00 0.000160 3 53 rt_sigprocmask
0.00 0.000073 2 26 write
0.00 0.000071 5 12 gettid
0.00 0.000067 1 36 ioctl
0.00 0.000067 1 37 timerfd_settime
0.00 0.000060 4 14 rseq
0.00 0.000053 3 15 set_robust_list
0.00 0.000042 42 1 restart_syscall
0.00 0.000018 2 9 brk
0.00 0.000015 1 13 madvise
0.00 0.000009 9 1 shmat
0.00 0.000009 4 2 2 statfs
0.00 0.000007 7 1 shmget
0.00 0.000005 2 2 sysinfo
0.00 0.000003 1 2 2 access
0.00 0.000003 1 2 1 shmctl
0.00 0.000003 1 3 getpid
0.00 0.000003 1 3 sched_getaffinity
0.00 0.000003 3 1 getrandom
0.00 0.000002 1 2 getrusage
0.00 0.000002 1 2 1 arch_prctl
0.00 0.000002 2 1 set_tid_address
0.00 0.000002 2 1 prlimit64
0.00 0.000001 1 1 lseek
0.00 0.000000 0 4 rt_sigaction
0.00 0.000000 0 4 pread64
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 fcntl
0.00 0.000000 0 4 fchdir
0.00 0.000000 0 1 setsid
0.00 0.000000 0 2 getpriority
0.00 0.000000 0 1 setpriority
0.00 0.000000 0 1 timerfd_create
0.00 0.000000 0 1 pipe2
0.00 0.000000 0 1 io_uring_setup
0.00 0.000000 0 1 io_uring_register
------ ----------- ----------- --------- --------- ------------------
100.00 7.482218 20 358415 751 total
$ strace --summary-only -f fio --name=random-write-test-io-uring-batch --ioengine=io_uring --rw=randwrite --size=1G --bs=4k --size=1G --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5 --end_fsync=1
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
39.95 1.824118 96006 19 pselect6
39.57 1.806977 4793 377 wait4
17.16 0.783407 7 104814 io_uring_enter
2.91 0.132968 66484 2 fadvise64
0.13 0.005748 1 3109 397 newfstatat
0.07 0.003330 65 51 futex
0.05 0.002495 6 387 1 clock_nanosleep
0.04 0.001942 36 53 rt_sigprocmask
0.03 0.001274 2 566 31 openat
0.02 0.000925 1 860 getdents64
0.02 0.000900 3 272 mmap
0.01 0.000560 1 539 close
0.01 0.000317 7 45 munmap
0.01 0.000310 310 1 clone
0.01 0.000308 2 143 19 read
0.01 0.000259 3 82 mprotect
0.00 0.000153 153 1 shmdt
0.00 0.000142 10 13 madvise
0.00 0.000063 2 22 write
0.00 0.000028 1 18 ioctl
0.00 0.000028 2 13 clone3
0.00 0.000022 1 19 timerfd_settime
0.00 0.000014 14 1 lseek
0.00 0.000014 7 2 getrusage
0.00 0.000014 7 2 getpriority
0.00 0.000013 13 1 restart_syscall
0.00 0.000011 0 12 sched_setaffinity
0.00 0.000008 8 1 shmat
0.00 0.000008 8 1 setpriority
0.00 0.000007 7 1 shmget
0.00 0.000006 3 2 2 statfs
0.00 0.000006 6 1 pipe2
0.00 0.000006 0 14 rseq
0.00 0.000005 2 2 sysinfo
0.00 0.000005 0 15 set_robust_list
0.00 0.000005 5 1 timerfd_create
0.00 0.000004 2 2 1 shmctl
0.00 0.000004 1 3 getpid
0.00 0.000004 0 12 gettid
0.00 0.000004 1 3 sched_getaffinity
0.00 0.000003 0 9 brk
0.00 0.000003 0 4 rt_sigaction
0.00 0.000003 1 2 2 access
0.00 0.000002 2 1 fcntl
0.00 0.000001 0 2 1 arch_prctl
0.00 0.000001 1 1 set_tid_address
0.00 0.000001 1 1 prlimit64
0.00 0.000001 1 1 getrandom
0.00 0.000000 0 4 pread64
0.00 0.000000 0 1 execve
0.00 0.000000 0 4 fchdir
0.00 0.000000 0 1 setsid
0.00 0.000000 0 1 io_uring_setup
0.00 0.000000 0 1 io_uring_register
------ ----------- ----------- --------- --------- ------------------
100.00 4.566427 40 111515 454 total