A user-friendly approach to application-agnostic state synchronization
2023-08-03
userfaultfd
and NBD, which challenges exist, which
optimizations can be done and how such a universal API and related wire
protocols can look likeopen()
, read()
,
write()
and close()
ioctl
modprobe
, rmmod
etc.)[6]sigaction()
[8]userfaultfd
), which allows sharing a file descriptor
between processesmmap
inotify
dentry
and
inode
cachesopen
, read
, write
etc.curl
) can be piped into another command
(e.g. jq
) to achieve a larger goaluserfaultfd
SIGSEGV
signal
handlers to use handle this from the programmmap
and Hashingmmap
allows mapping a memory region
to a filemmap
ed regions still using caching to speed up
readsmsync
mmap
ed regionsmmap
a
file into memorymmap
-based solution for pre- and/or post-copy
migrationmmap
ing a file, a device is
mmap
edMount
that consists of both
a client and a server, both running on the local systemReaderAt
ReadWriterAt
msync
/flushes the drive, and
returns the chunks that were changed between it started tracking and
finalizingmsync
the seeder’s app state, the RTT and, if they
are being accessed immediately, how long it takes to fetch the chunks
that were written in between starting to track and finalizeFinalize
multiple times to restart itFinalize
needs to return a list of dirty
chunks, it requires the VM or app on the source device to be suspended
before Finalize
can returnuserfaultfd
SIGSEGV
signal in the processuserfaultfd
to do this in a more elegant way (available
since kernel 4.11)userfaultfd
allows handling these page faults in
userspacemmap
userfaultfd
API, we need to transfer this file descriptor to a process that should
respond with the chunks of memory to be put into the faulting
addressUFFDIO_COPY
ioctl
and a pointer to
the chunk of memory that should be used on the file descriptor (code
snippet from
https://github.com/loopholelabs/userfaultfd-go/blob/master/pkg/mapper/handler.go)userfault-go
unsafe
syscall
and unix
packages
to interact with ioctl
etc.ioctl
syscall to get a file descriptor
to the userfaultfd
API, and then register the API to handle
any faults on the region (code snippet from
https://github.com/loopholelabs/userfaultfd-go/blob/master/pkg/mapper/register.go#L15)userfaultfd
backends
userfaultfd
and the pull method
is that we are able to simplify the backend of the entire system down to
a io.ReaderAt
(code snippet from
https://pkg.go.dev/io#ReaderAt)io.ReaderAt
as a
backend for a userfaultfd-go
registered objectio.ReaderAt
, we can also use a
file as the backend directly, creating a system that essentially allows
for mounting a (remote) file into memory (code snippet from
https://github.com/loopholelabs/userfaultfd-go/blob/master/cmd/userfaultfd-go-example-file/main.go)mmap
, which allows us to map a
file into memorymmap
doesn’t write changes from a file back
into memory, no matter if the file descriptor passed to it would allow
it to or notMAP_SHARED
flag; this tells the
kernel to write back changes to the memory region to the corresponding
regions of the backing fileuserfaultfd
sync
ed in order for them to be written to disks,
mmap
ed regions need to be msync
ed in order to
flush changes to the backing filemsync
is
criticalO_DIRECT
to skip this kernel
caching if your process already does caching on its own, but this flag
is ignored by the mmap
inotify
vs. polling
inotify
to watch changes to a
fileinotify
allows applications to register handlers on a
file’s events, e.g. WRITE
or SYNC
. This allows
for efficient file synchronization, and is used by many file
synchronization toolsmmap
ed files though, so we can’t use itrsync
, and to support a
central forwarding up instead of requiring P2P connectivity between each
hostsyncerID
src-control
peer type (code snippet from
https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L824-L844)
dst-control
peer type (code snippet from
https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L845-L880)
src-control
peersdst-control
peers
so that it can subscribedst
peer type (code snippet from
https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L882-L942)
src-control
peersrc-control
peersrc-control
peer has connected to the
hub with this ID with a new src-data
peer by listening for
src-data
peer ID broadcastsdst
peer, relaying any information between the twosrc-data
peer type (code snippet from
https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L943-L954)
dst
peer to
continuesrc-control
type of peerdst
peer typesdst
peer types send a IDsrc-data
peerdst
peerdst
peer through the multiplexerdst-control
type of peerdst
peerdst
peerGetHashesForBlocks
handles the concurrent calculation
of the hashes for a file on both the file transmitter and receiverafero.Fs
afero.Fs
,
we can separate the FUSE implementation from the actual file system
structure, making it unit testable and making it possible to add caching
in user space (code snippet from
https://github.com/pojntfx/stfs/blob/main/pkg/fs/file.go)afero.Fs
to a FUSE backend,
so it would be possible to switch between different file system backends
without having to write FUSE-specific (code snippet from
https://github.com/JakWai01/sile-fystem/blob/main/pkg/filesystem/fs.go)go-nbd
go-nbd
is very simple and
only requires four methods: ReadAt
, WriteAt
,
Size
and Sync
userfaultfd-go
is that they can also handle writesgo-nbd
exposes a Handle
function to
support multiple users without depending on a specific transport layer
(code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go)accept
syscall can still be
used easilybinary
package (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L73-L76)NEGOTIATION_ID_OPTION_INFO
and
NEGOTIATION_ID_OPTION_GO
exchange the information about the
chosen export (i.e. block size, export size, export name and description
etc.), and if GO
is specified it continues directly to the
transmission phaseNEGOTIATION_ID_OPTION_LIST
encodes and sends the list
of exports to the clientNEGOTIATION_ID_OPTION_ABORT
aborts the handshake,
causing the server to close the connectionTRANSMISSION_TYPE_REQUEST_READ
executes a read request
on the backend and sends the relevant chunk to the client
(https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L331-L351)TRANSMISSION_TYPE_REQUEST_WRITE
reads an offset and
chunk from the client, then writes it to the client; if the read-only
option is specified for the server, a permission error is returned to
the client instead
(https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L353-L368)TRANSMISSION_TYPE_REQUEST_DISC
gracefully disconnects
from the server and causes the backend to sync, i.e. to flush it’s
changes to the disknbds_max
parametersysfs
(code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/utils/unused.go)Connect
and a connection
and options are providedioctl
numbers depend on the kernel and are
extracted using CGo
(code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/ioctl/negotiation_cgo.go)NEGOTIATION_TYPE_REPLY_INFO
, the client handles
the export size with NEGOTIATION_TYPE_INFO_EXPORT
, the
name, description and block size with
NEGOTIATION_TYPE_INFO_BLOCKSIZE
ioctl
s (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L290-L328)go-nbd
can also list exports with
List
NEGOTIATION_ID_OPTION_LIST
is providedDO_IT
syscall never returns until it is
disconnected, meaning that an external system must be used to detect
whether the device is actually readysysfs
for the size parameter, or by using
udev
udev
manages devices in Linuxudev
event, which we can subscribe to and use as a reliable
and idiomatic way of waiting for the ready state (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L104C10-L138)sysfs
directly can be
faster than subscribing to the udev
event, so we give the
user the option to switch between both options (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L140-L178)ioctl
s:
TRANSMISSION_IOCTL_CLEAR_QUE
,
TRANSMISSION_IOCTL_DISCONNECT
and
TRANSMISSION_IOCTL_CLEAR_SOCK
(https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L334-L359)ioctl
s causes DO_IT
to exit,
thus terminating the call to Connect
open
ing the block device that the client has
connected to, usually the kernel does provide a caching mechanism and
thus requires sync
to flush changesO_DIRECT
however, it is
possible to skip the kernel caching layer and write all changes directly
to the NBD client/serversync
ing
should be as small as possibleread
/write
syscalls, like a local file
would be accessedOpen
and
Close
for the device with both mounts, preventing potential
synchronization pitfalls over just manually implementing it in each
applicationmmap
s the NBD device directlymmap
/slice approach has a few benefitsmake
, except its
transparently mapped to the (remote) backendmmap
/the byte slices also swaps out the syscall-based
file interface with a truly random access one, which allows for faster
concurrent reads from the underlying backendmmap
one or multiple files
on the mounted file system instead of mmap
ing the block
device directlyReadWriterAt
pipeline
ReadWriterAt
, combining an io.ReaderAt
and a
io.WriterAt
Size
and Sync
syscalls directly to the underlying backend, but wrap a backend’s
ReadAt
and WriteAt
methods in a pipeline of
other ReadWriterAt
sReadWriterAt
is the
ArbitraryReadWriterAt
(code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/arbitrary_rwat.go)ReadAt
, it calculates the index of the chunk that
the offset falls into and the position within the offsetsChunkedReadWriterAt
, which ensures that the limits
concerning the maximum size supported by the backend and the actual
chunks are being respectedPuller
component asynchronously pulls chunks in the
background (code snipped from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/puller.go)ReaderAt
, which is
then expected to handle the actual copying on its ownSyncedReadWriterAt
insteadReadWriterAt
ReadAt
, it is tracked and market as remote by adding it to
a local mapReadWriterAt
, and is then marked as locally
available, so that on the second read it is fetched locally
directlyPuller
,
this also means that if a chunk which hasn’t been fetched asynchronously
yet will be scheduled to be pulled immediatelySyncedReadWriterAt
and the
Puller
component implements the pull post-copy system in a
modular and testable wayPuller
interface, it is possible to
implement a read-only managed mountrr+
prefetching mechanism from
“Remote Regions” (reference atc18-aguilera)Sync
in a set recurring interval (code snippet
from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/pusher.go)ReadWriter
, and copies it to the remoteMarkOffsetPushable
method (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L171C24-L185)ReadWriterAt
pipelineReadAt
and
WriteAt
ReadAt
is a simple proxy, while WriteAt
also marks a chunk as pushable (since it mutates data) before writing to
the local ReadWriterAt
Pusher
step is simply
skipped (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L142-L169)ArbitraryReadWriter
is used (graphic of the four systems
and how they are connected to each other vs. how the direct mounts
work)ReaderWriterAt
s and a simple table-driven test can
be created (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/puller_test.go)ReadWriterAt
before the NBD device is openPuller
is simply skipped (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L187-L222)mmap
interfaces, the managed mount
builds on this interface in order to provide the same interfacesSync()
API,
e.g. the msync
on the mmap
ed file must happen
before Sync()
is called on the syncerClose
ing the mountReadWriterAt
components allow the reuse of lots of code for both the mount API and
the migration APIReadAt
methods, but also new APIs such as returning dirty
chunks from Sync
and adding a Track
method
(code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder.go#L15-L21)TrackingReadWriterAt
that is connected to the seeder’s
ReadWriterAt
pipelineTrack
, the tracker intercepts all
WriteAt
calls and adds them to a local de-duplicated store
(code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/tracking_rwat.go#L28-L40)Sync
is called, the changed chunks are returned
and the de-duplicated store is clearedservice
utility struct struct from Open
(code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder.go)LockableReadWriterAt
, into its internal pipeline (add
graphic of the internal pipeline)Finalize
has been called (code snippet
from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/lockable_rwat.go#L19-L37)Finalize
did not yet mark the changed chunks) could have
poisoned the cache on the e.g. mmap
ed deviceFinalize
can be calledFinalize
then calls Sync()
on the remote,
marks the changed chunks as remote, and schedules them also be pulled in
the background (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/migration/path_leecher.go#L257-L280)ReadWriterAt
to make accessing the path/file/slice too
early harder, only Finalize
returns the managed object, so
that the happy path can less easily lead to deadlocksClose
on the seeder and disconnects the leecher from
the seeder, causing both to shut down (code snippet from
https://github.com/pojntfx/r3map/blob/main/cmd/r3map-migration-benchmark-server/main.go#L137)ReadAt
requests’ latency is an important metric
since it strongly correlates with the total latencyReadAt
and WriteAt
are implemented using
Cassandra’s query language (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/backend/cassandra.go#L36-L63)dial
able
dial
able from the
sourceReadAt
, is called, it is looked up
via reflection and validated (code snippet from
https://github.com/pojntfx/dudirekta/blob/main/pkg/rpc/registry.go#L323-L357)go-nbd
backend
interface is implemented for the remote representation, creating a
universally reusable, generic RPC backend wrapper (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/backend/rpc.go)dial
ing the server would mean that the server could not
reference the multiple client connections as one composite client
without changes to the protocolproto3
DSLReadAt
RPCs doesn’t require any encoding (wereas JSON, used
by dudirekta, base64
encodes these chunks)userfaultfd
msync
userfaultfd
we are able to map almost any
object into memoryuserfaultfd
API socket is also synchronous, so each
chunk needs to be sent one after the other, meaning that it is very
vulnerable to long RTT valuesuserfaultfd
, this system also has
limitationsuserfaultfd
was only able to catch reads, this
system is only able to catch writes to the fileOpen()
time and thus reduce the overhead of mounting a
remote ressourceublk
as an alternative
ublk
could also be usedublk
uses io_uring
, which means that it
could potentially allow for much faster concurrent accessublk
driver that
creates /dev/ublkb*
devicesublk
is able
to use io_uring
pass-through commandsio_uring
architecture promises lower latency and
better throughputmmap
API is used, it is possible that the GC
tries to manage the underlying slice, or tries to release memory as data
is being copied from the mountmmap
ed region into
memory, but this will also cause all chunks to be fetched, which leads
to a high Open()
latencymmap
ingram-dl
ram-ul
“uploads” RAM by exposing a memory, file or
directory-backed file over fRPCram-dl
then does all of the abovetapisk
bbolt
DB is used as an indexMTSEEK
ioctl to
fast-forward to a record on the tape (code snippet from
https://github.com/pojntfx/tapisk/blob/main/pkg/mtio/tape.go#L25-L40)tapisk
is a unique usecase that shows the versitility
of the approach chosen and how flexible it isReadWriterAt
, in the same way as the directory backendtar
s3-fuse
s3-fuse
) when they
are communicated to the FUSE by the kernel, which makes writes very
slowinotify
support, missing permissions etc.)>0
mmap
ing it directly to fetch it[]byte
[]byte
from the block device, we can use the managed mount/direct
mount/migration APIs to send/receive them or mount them[]byte
, which by definition every
state is (def. a process level or VM level etc.)userfaultfd
is an interesting API and very idiomatic to
both Linux as a first-party solution, and Go due to its fairly low
implementation overhead, but does fall short in throughput, especially
when used in WAN networks, where other options provide better
performancemmap
ed files does
provide a simple way of memory synchronization for specific scenarios,
but does have a very significant I/O and compute overhead making it
unsuitable for most applicationsublk
, could further improve on the
implementation presentedAPI: Application Programming Interface I/O: Input/Output IO: Input/Output OS: Operating System CPU: Central Processing Unit RAM: Random Access Memory SSD: Solid State Drive HDD: Hard Disk Drive UUID: Universally Unique Identifier CRC32: Cyclic Redundancy Check 32-Bit LRU: Least Recently Used WAN: Wide Area Network LAN: Local Area Network TCP: Transmission Control Protocol UDP: User Datagram Protocol P2P: Peer-To-Peer NATs: Network Address Translators IPC: Inter-Process Communication RTT: Round-Trip Time SRP: SCSI RDMA Protocol GNU: GNU’s Not Unix UNIX: UNIX Family of Operating Systems macOS: Apple Macintosh Operating System FreeBSD: Free Berkeley Software Distribution NBD: Network Block Device S3fs: S3 File System NVMe: Non-Volatile Memory Express LTFS: Linear Tape File System LTO: Linear Tape-Open EXT4: Fourth Extended Filesystem Btrfs: B-Tree File System C: C Programming Language Rust: Rust Programming Language Go: Go Programming Language C++: C++ Programming Language ARM: ARM RISC Computer Processor Architecture x86: x86 CISC Computer Processor Architecture RISC-V: RISC-V RISC Computer Processor Architecture LPDDR5: Low-Power Double Data Rate 5 HTTP: Hypertext Transfer Protocol HTTPS: HTTP Secure HTTP/2: HTTP Version 2 QUIC: Quick UDP Internet Connections WebRTC: Web Real-Time Communication Wasm: WebAssembly IETF: Internet Engineering Task Force OIDC: OpenID Connect AWS: Amazon Web Services CNCF: Cloud Native Computing Foundation S3: Simple Storage Service TLS: Transport Layer Security mTLS: Mutual TLS SSH: Secure Shell DoS: Denial of Service JSON: JavaScript Object Notation JSONL: JSON Lines SQL: Structured Query Language NoSQL: Not Only SQL Protobuf: Protocol Buffers IDL: Interface Definition Language DSL: Domain-Specific Language KV: Key-Value Syscalls: System Calls VM: Virtual Machine RPC: Remote Procedure Call REST: Representational State Transfer FUSE: File Systems in Userspace