io_uring - part I | k-jingyang

Inspired by https://notes.eatonphil.com/2023-10-19-write-file-to-disk-with-io_uring.html, I decided to dig deeper into io_uring as it’s a modern Linux API for IO use cases

Personally, I wanted to explore it for databases - disk I/O

What is io_uring

tldr; There are 2 ring buffers. One for submitting I/O requests (possible to do in batches) to the kernel, the other to receive the results from the kernel. The kernel interacts with the buffer directly. io_uring reduces syscalls and reduces copying between buffers.

https://unixism.net/loti/what_is_io_uring.html

Using fio to improve my understanding

Like https://notes.eatonphil.com/2023-10-19-write-file-to-disk-with-io_uring.html, I tried to write my own tests using giouring and iouring-go to benchmark writes.

It didn’t seem like there was much differences in writes. At this point, there’s a possibility that my tests/the library used were not written optimally. So I decided to try out fio.

fio is written by the Jens Axboe, who also wrote io_uring. This will definitely the best representation.

# My kernel version
Linux 6.8.0-79-generic

$ fio --version
fio-3.28

Comparing sequential writes

# sync
# IOPS=94.1k, BW=370MiB/s
fio --name=seq-write-test-sync --ioengine=sync --rw=write --size=1G --bs=4k --end_fsync=1

# io_uring 1 entry, submit 1 at a time, complete 1 at a time
# IOPS=66.9k, BW=261MiB/s
fio --name=seq-write-test-io-uring --ioengine=io_uring --rw=write --size=1G --bs=4k  --size=1G --end_fsync=1

# io_uring 5 entries, submit 5 at a time, complete 1 - 5
# IOPS=94.0k, BW=367MiB/s
fio --name=seq-write-test-io-uring-batch --ioengine=io_uring --rw=write --size=1G --bs=4k  --size=1G  --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5 --end_fsync=1

Comparing random writes

# sync
# IOPS=92.4k, BW=361MiB/s
fio --name=random-write-test-sync --ioengine=sync --rw=randwrite --size=1G --bs=4k --end_fsync=1

# io_uring 1 entry, submit 1 at a time, complete 1 at a time
# IOPS=65.8k, BW=257MiB/s
fio --name=random-write-test-io-uring --ioengine=io_uring --rw=randwrite --size=1G --bs=4k  --size=1G --end_fsync=1

# io_uring 5 entries, submit 5 at a time, complete 1 - 5
# IOPS=92.9k, BW=363MiB/s
fio --name=random-write-test-io-uring-batch --ioengine=io_uring --rw=randwrite --size=1G --bs=4k  --size=1G  --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5 --end_fsync=1

It’s interesting to note that:

The basic io_uring configuration (1 entry, submit 1 at a time, complete 1 at a time) was worse than sync
The batching io_uring configuration is about on-par as sync
The above 2 points apply to both sequential and random writes
Sequential writes and random writes perform similarly
- This goes against my understanding that SSD random writes require more work (i.e. erase + write) than sequential writes. There could be an optimisation down the stack.

Comparing sequential reads

# create file to test
fio --name=create_file --rw=write --bs=4k --filename=test_file.dat --size=2G --end_fsync=1

# sync
# IOPS=99.3k, BW=388MiB/s
fio --name=seq-read-test-sync --rw=read --bs=4k --filename=test_file.dat --size=2G --ioengine=sync

# io_uring 1 entry, submit 1 at a time, complete 1 at a time
# IOPS=97.7k, BW=381MiB/s
fio --name=seq-read-test-io-uring --rw=read --bs=4k --filename=test_file.dat --size=2G --ioengine=io_uring

# io_uring 5 entries, submit 5 at a time, complete 1 - 5
# IOPS=97.8k, BW=382MiB/s
fio --name=seq-read-test-io-uring-batch --rw=read --bs=4k --filename=test_file.dat --size=2G --ioengine=io_uring --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5

Comparing random reads

# sync
# IOPS=9284, BW=36.3MiB/s
fio --name=random-read-test-sync --rw=randread --bs=4k --filename=test_file.dat --size=2G --ioengine=sync

# io_uring 1 entry, submit 1 at a time, complete 1 at a time
# IOPS=9302, BW=36.3MiB/s
fio --name=random-read-test-io-uring --rw=randread --bs=4k --filename=test_file.dat --size=2G --ioengine=io_uring

# io_uring 5 entries, submit 5 at a time, complete 1 - 5
# IOPS=42.4k, BW=166MiB/s
fio --name=random-read-test-io-uring-batch --rw=randread --bs=4k --filename=test_file.dat --size=2G --ioengine=io_uring --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5

For sequential reads,

sync, basic io_uring and advanced io_uring perform similarly
sync performing slightly better than io_uring

For random reads,

The basic io_uring configuration is about on-par as sync
The batching io_uring configuration outperforms sync by close to 5x

Why is the write performance similar

I tried to strace the write tests. Sadly, using strace slows down the sync test significantly. This is most likely due to the strace overhead caused by the no. of syscalls. See write strace(s) below for the traces

sync has 500k syscalls, io_uring_basic has 380k, io_uring_batch only has 110k.

My guess is that writes are super fast due to writes not hitting the disk as much (i.e. RAM cache buffering all writes before flushing to disk). Because of this, the benefit from io_uring is not as apparent.

Comparing this against reads where we really make the disk work hard, as seen by the lower IOPS. In such cases, the async nature of io_uring provided much obvious benefits (i.e. we don’t need to busy-wait while data is being fetched from disk).

Why is the sequential read performance similar

So I found out that the kernel merges IO requests when reading from the same block to improve performance.

Even though we’re supposedly fetching 4KB blocks from the disk each time, the kernel actually accumulates these requests and requests a larger block from the disk.

To support this idea, I tried running with block size of 8KB

# 4KB blocks
# IOPS=99.3k, BW=388MiB/s
fio --name=seq-read-test-sync --rw=read --bs=4k --filename=test_file.dat --size=2G --ioengine=sync

# 8KB blocks
# IOPS=48.7k, BW=380MiB/s
fio --name=seq-read-test-sync --rw=read --bs=8k --filename=test_file.dat --size=2G --ioengine=sync

IOPS is obviously halved. But the bandwidth is around the same. If the IO requests were not merged, I’m sure the bandwidth would show a significant difference.

With this in mind, we now know that we’re not actually hitting the disk as much as we expected. Hence, the async nature of io_uring don’t really get the chance to shine.

Sidenote: My PC uses and SSD and is using the mq-deadline I/O scheduler. The article quoted above on merging IO requests was in the context of HDDs. I would assume that the concept applies for SSDs.

$ cat /sys/block/sda/queue/scheduler
none [mq-deadline]

Thoughts

From this test, it seems like a naive usage of io_uring will only bring huge benefits to random read workloads but not much for sequential reads and writes.

Though, there are still some unexplored points:

if sync and io_uring perform similarly for sequential reads when reading multiple files concurrently
- Given what we know about merging, if the files are in different blocks, the IO requests may not coalesce. This can result in more actual disk reads and io_uring can help a lot here
tuning of io_uring params

What’s next

I will experiment using io_uring to an open source Go-based database with a focus on the read path. Will follow-up with another post.

Appendix: write strace(s)

$ strace --summary-only -f fio --name=random-write-test-sync --ioengine=sync --rw=randwrite --size=1G --bs=4k --end_fsync=1

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 38.27    2.820980       55313        51           pselect6
 37.87    2.791422        3125       893           wait4
 15.92    1.173743           4    262171           write
  5.30    0.390505           1    261883           lseek
  2.25    0.165909      165909         1           fsync
  0.11    0.008233           2      3646       915 newfstatat
  0.10    0.007042           7       903         1 clock_nanosleep
  0.07    0.004814         100        48           futex
  0.03    0.001904           3       588        31 openat
  0.02    0.001223           1       860           getdents64
  0.01    0.001100          91        12           sched_setaffinity
  0.01    0.000799           1       560           close
  0.01    0.000636           2       228        51 read
  0.01    0.000618           2       269           mmap
  0.01    0.000442         442         1           fallocate
  0.00    0.000357         357         1           clone
  0.00    0.000271           3        82           mprotect
  0.00    0.000271          20        13           clone3
  0.00    0.000262           6        42           munmap
  0.00    0.000195           3        53           rt_sigprocmask
  0.00    0.000173         173         1           shmdt
  0.00    0.000132           2        50           ioctl
  0.00    0.000098           7        13           madvise
  0.00    0.000098           1        51           timerfd_settime
  0.00    0.000065           4        15           set_robust_list
  0.00    0.000051           4        12           gettid
  0.00    0.000049           3        14           rseq
  0.00    0.000028           3         9           brk
  0.00    0.000026          13         2           getrusage
  0.00    0.000025           8         3           fadvise64
  0.00    0.000014           7         2           getpriority
  0.00    0.000010           5         2         2 statfs
  0.00    0.000008           8         1           setpriority
  0.00    0.000006           6         1           shmget
  0.00    0.000006           6         1           shmat
  0.00    0.000005           1         4           fchdir
  0.00    0.000004           2         2           sysinfo
  0.00    0.000004           1         3           sched_getaffinity
  0.00    0.000003           1         2         2 access
  0.00    0.000003           1         2         1 shmctl
  0.00    0.000002           2         1           set_tid_address
  0.00    0.000002           2         1           getrandom
  0.00    0.000001           0         4           rt_sigaction
  0.00    0.000001           0         3           getpid
  0.00    0.000001           1         1           fcntl
  0.00    0.000001           0         2         1 arch_prctl
  0.00    0.000001           1         1           restart_syscall
  0.00    0.000001           1         1           prlimit64
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1         1 unlink
  0.00    0.000000           0         1           setsid
  0.00    0.000000           0         1           timerfd_create
  0.00    0.000000           0         1           pipe2
------ ----------- ----------- --------- --------- ------------------
100.00    7.371544          13    532518      1005 total

$ strace --summary-only -f fio --name=random-write-test-io-uring --ioengine=io_uring --rw=randwrite --size=1G --bs=4k  --size=1G --end_fsync=1

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 41.95    3.138922       84835        37           pselect6
 41.64    3.115477        4771       653           wait4
 14.14    1.058010           3    350742           io_uring_enter
  1.95    0.145944       72972         2           fadvise64
  0.09    0.007008           2      3396       673 newfstatat
  0.09    0.006491         113        57         3 futex
  0.06    0.004437           6       663         1 clock_nanosleep
  0.02    0.001335           2       577        31 openat
  0.01    0.001024           1       860           getdents64
  0.01    0.000607           1       550           close
  0.01    0.000423           1       272           mmap
  0.00    0.000357           1       190        37 read
  0.00    0.000350         350         1           clone
  0.00    0.000318           7        45           munmap
  0.00    0.000249          19        13           clone3
  0.00    0.000211           2        82           mprotect
  0.00    0.000208          17        12           sched_setaffinity
  0.00    0.000167         167         1           shmdt
  0.00    0.000160           3        53           rt_sigprocmask
  0.00    0.000073           2        26           write
  0.00    0.000071           5        12           gettid
  0.00    0.000067           1        36           ioctl
  0.00    0.000067           1        37           timerfd_settime
  0.00    0.000060           4        14           rseq
  0.00    0.000053           3        15           set_robust_list
  0.00    0.000042          42         1           restart_syscall
  0.00    0.000018           2         9           brk
  0.00    0.000015           1        13           madvise
  0.00    0.000009           9         1           shmat
  0.00    0.000009           4         2         2 statfs
  0.00    0.000007           7         1           shmget
  0.00    0.000005           2         2           sysinfo
  0.00    0.000003           1         2         2 access
  0.00    0.000003           1         2         1 shmctl
  0.00    0.000003           1         3           getpid
  0.00    0.000003           1         3           sched_getaffinity
  0.00    0.000003           3         1           getrandom
  0.00    0.000002           1         2           getrusage
  0.00    0.000002           1         2         1 arch_prctl
  0.00    0.000002           2         1           set_tid_address
  0.00    0.000002           2         1           prlimit64
  0.00    0.000001           1         1           lseek
  0.00    0.000000           0         4           rt_sigaction
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         4           fchdir
  0.00    0.000000           0         1           setsid
  0.00    0.000000           0         2           getpriority
  0.00    0.000000           0         1           setpriority
  0.00    0.000000           0         1           timerfd_create
  0.00    0.000000           0         1           pipe2
  0.00    0.000000           0         1           io_uring_setup
  0.00    0.000000           0         1           io_uring_register
------ ----------- ----------- --------- --------- ------------------
100.00    7.482218          20    358415       751 total

$ strace --summary-only -f fio --name=random-write-test-io-uring-batch --ioengine=io_uring --rw=randwrite --size=1G --bs=4k  --size=1G  --iodepth=5 --iodepth_batch_submit=5 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=5 --end_fsync=1

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 39.95    1.824118       96006        19           pselect6
 39.57    1.806977        4793       377           wait4
 17.16    0.783407           7    104814           io_uring_enter
  2.91    0.132968       66484         2           fadvise64
  0.13    0.005748           1      3109       397 newfstatat
  0.07    0.003330          65        51           futex
  0.05    0.002495           6       387         1 clock_nanosleep
  0.04    0.001942          36        53           rt_sigprocmask
  0.03    0.001274           2       566        31 openat
  0.02    0.000925           1       860           getdents64
  0.02    0.000900           3       272           mmap
  0.01    0.000560           1       539           close
  0.01    0.000317           7        45           munmap
  0.01    0.000310         310         1           clone
  0.01    0.000308           2       143        19 read
  0.01    0.000259           3        82           mprotect
  0.00    0.000153         153         1           shmdt
  0.00    0.000142          10        13           madvise
  0.00    0.000063           2        22           write
  0.00    0.000028           1        18           ioctl
  0.00    0.000028           2        13           clone3
  0.00    0.000022           1        19           timerfd_settime
  0.00    0.000014          14         1           lseek
  0.00    0.000014           7         2           getrusage
  0.00    0.000014           7         2           getpriority
  0.00    0.000013          13         1           restart_syscall
  0.00    0.000011           0        12           sched_setaffinity
  0.00    0.000008           8         1           shmat
  0.00    0.000008           8         1           setpriority
  0.00    0.000007           7         1           shmget
  0.00    0.000006           3         2         2 statfs
  0.00    0.000006           6         1           pipe2
  0.00    0.000006           0        14           rseq
  0.00    0.000005           2         2           sysinfo
  0.00    0.000005           0        15           set_robust_list
  0.00    0.000005           5         1           timerfd_create
  0.00    0.000004           2         2         1 shmctl
  0.00    0.000004           1         3           getpid
  0.00    0.000004           0        12           gettid
  0.00    0.000004           1         3           sched_getaffinity
  0.00    0.000003           0         9           brk
  0.00    0.000003           0         4           rt_sigaction
  0.00    0.000003           1         2         2 access
  0.00    0.000002           2         1           fcntl
  0.00    0.000001           0         2         1 arch_prctl
  0.00    0.000001           1         1           set_tid_address
  0.00    0.000001           1         1           prlimit64
  0.00    0.000001           1         1           getrandom
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         4           fchdir
  0.00    0.000000           0         1           setsid
  0.00    0.000000           0         1           io_uring_setup
  0.00    0.000000           0         1           io_uring_register
------ ----------- ----------- --------- --------- ------------------
100.00    4.566427          40    111515       454 total