A user-friendly approach to application-agnostic state synchronization
2023-08-03
userfaultfd and NBD, which challenges exist, which
optimizations can be done and how such a universal API and related wire
protocols can look likeopen(), read(),
write() and close()ioctl
modprobe, rmmod etc.)[6]sigaction()[8]userfaultfd), which allows sharing a file descriptor
between processesmmapinotifydentry and
inode cachesopen, read, write
etc.curl) can be piped into another command
(e.g. jq) to achieve a larger goaluserfaultfdSIGSEGV signal
handlers to use handle this from the programmmap and Hashingmmap allows mapping a memory region
to a filemmaped regions still using caching to speed up
readsmsyncmmaped regionsmmap a
file into memorymmap-based solution for pre- and/or post-copy
migrationmmaping a file, a device is
mmapedMount that consists of both
a client and a server, both running on the local systemReaderAtReadWriterAtmsync/flushes the drive, and
returns the chunks that were changed between it started tracking and
finalizingmsync the seeder’s app state, the RTT and, if they
are being accessed immediately, how long it takes to fetch the chunks
that were written in between starting to track and finalizeFinalize multiple times to restart itFinalize needs to return a list of dirty
chunks, it requires the VM or app on the source device to be suspended
before Finalize can returnuserfaultfdSIGSEGV signal in the processuserfaultfd to do this in a more elegant way (available
since kernel 4.11)userfaultfd allows handling these page faults in
userspacemmapuserfaultfd
API, we need to transfer this file descriptor to a process that should
respond with the chunks of memory to be put into the faulting
addressUFFDIO_COPY ioctl and a pointer to
the chunk of memory that should be used on the file descriptor (code
snippet from
https://github.com/loopholelabs/userfaultfd-go/blob/master/pkg/mapper/handler.go)userfault-go
unsafesyscall and unix packages
to interact with ioctl etc.ioctl syscall to get a file descriptor
to the userfaultfd API, and then register the API to handle
any faults on the region (code snippet from
https://github.com/loopholelabs/userfaultfd-go/blob/master/pkg/mapper/register.go#L15)userfaultfd backends
userfaultfd and the pull method
is that we are able to simplify the backend of the entire system down to
a io.ReaderAt (code snippet from
https://pkg.go.dev/io#ReaderAt)io.ReaderAt as a
backend for a userfaultfd-go registered objectio.ReaderAt, we can also use a
file as the backend directly, creating a system that essentially allows
for mounting a (remote) file into memory (code snippet from
https://github.com/loopholelabs/userfaultfd-go/blob/master/cmd/userfaultfd-go-example-file/main.go)mmap, which allows us to map a
file into memorymmap doesn’t write changes from a file back
into memory, no matter if the file descriptor passed to it would allow
it to or notMAP_SHARED flag; this tells the
kernel to write back changes to the memory region to the corresponding
regions of the backing fileuserfaultfdsynced in order for them to be written to disks,
mmaped regions need to be msynced in order to
flush changes to the backing filemsync is
criticalO_DIRECT to skip this kernel
caching if your process already does caching on its own, but this flag
is ignored by the mmapinotify vs. polling
inotify to watch changes to a
fileinotify allows applications to register handlers on a
file’s events, e.g. WRITE or SYNC. This allows
for efficient file synchronization, and is used by many file
synchronization toolsmmaped files though, so we can’t use itrsync, and to support a
central forwarding up instead of requiring P2P connectivity between each
hostsyncerIDsrc-control peer type (code snippet from
https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L824-L844)
dst-control peer type (code snippet from
https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L845-L880)
src-control
peersdst-control peers
so that it can subscribedst peer type (code snippet from
https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L882-L942)
src-control peersrc-control peersrc-control peer has connected to the
hub with this ID with a new src-data peer by listening for
src-data peer ID broadcastsdst
peer, relaying any information between the twosrc-data peer type (code snippet from
https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L943-L954)
dst peer to
continuesrc-control type of peerdst peer typesdst peer types send a IDsrc-data peerdst peerdst peer through the multiplexerdst-control type of peerdst peerdst peerGetHashesForBlocks handles the concurrent calculation
of the hashes for a file on both the file transmitter and receiverafero.Fs
afero.Fs,
we can separate the FUSE implementation from the actual file system
structure, making it unit testable and making it possible to add caching
in user space (code snippet from
https://github.com/pojntfx/stfs/blob/main/pkg/fs/file.go)afero.Fs to a FUSE backend,
so it would be possible to switch between different file system backends
without having to write FUSE-specific (code snippet from
https://github.com/JakWai01/sile-fystem/blob/main/pkg/filesystem/fs.go)go-nbdgo-nbd is very simple and
only requires four methods: ReadAt, WriteAt,
Size and Syncuserfaultfd-go is that they can also handle writesgo-nbd exposes a Handle function to
support multiple users without depending on a specific transport layer
(code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go)accept syscall can still be
used easilybinary
package (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L73-L76)NEGOTIATION_ID_OPTION_INFO and
NEGOTIATION_ID_OPTION_GO exchange the information about the
chosen export (i.e. block size, export size, export name and description
etc.), and if GO is specified it continues directly to the
transmission phaseNEGOTIATION_ID_OPTION_LIST encodes and sends the list
of exports to the clientNEGOTIATION_ID_OPTION_ABORT aborts the handshake,
causing the server to close the connectionTRANSMISSION_TYPE_REQUEST_READ executes a read request
on the backend and sends the relevant chunk to the client
(https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L331-L351)TRANSMISSION_TYPE_REQUEST_WRITE reads an offset and
chunk from the client, then writes it to the client; if the read-only
option is specified for the server, a permission error is returned to
the client instead
(https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L353-L368)TRANSMISSION_TYPE_REQUEST_DISC gracefully disconnects
from the server and causes the backend to sync, i.e. to flush it’s
changes to the disknbds_max parametersysfs (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/utils/unused.go)Connect and a connection
and options are providedioctl numbers depend on the kernel and are
extracted using CGo (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/ioctl/negotiation_cgo.go)NEGOTIATION_TYPE_REPLY_INFO, the client handles
the export size with NEGOTIATION_TYPE_INFO_EXPORT, the
name, description and block size with
NEGOTIATION_TYPE_INFO_BLOCKSIZEioctls (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L290-L328)go-nbd can also list exports with
ListNEGOTIATION_ID_OPTION_LIST is providedDO_IT syscall never returns until it is
disconnected, meaning that an external system must be used to detect
whether the device is actually readysysfs for the size parameter, or by using
udevudev manages devices in Linuxudev event, which we can subscribe to and use as a reliable
and idiomatic way of waiting for the ready state (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L104C10-L138)sysfs directly can be
faster than subscribing to the udev event, so we give the
user the option to switch between both options (code snippet from
https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L140-L178)ioctls:
TRANSMISSION_IOCTL_CLEAR_QUE,
TRANSMISSION_IOCTL_DISCONNECT and
TRANSMISSION_IOCTL_CLEAR_SOCK
(https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L334-L359)ioctls causes DO_IT to exit,
thus terminating the call to Connectopening the block device that the client has
connected to, usually the kernel does provide a caching mechanism and
thus requires sync to flush changesO_DIRECT however, it is
possible to skip the kernel caching layer and write all changes directly
to the NBD client/serversyncing
should be as small as possibleread/write syscalls, like a local file
would be accessedOpen and
Close for the device with both mounts, preventing potential
synchronization pitfalls over just manually implementing it in each
applicationmmaps the NBD device directlymmap/slice approach has a few benefitsmake, except its
transparently mapped to the (remote) backendmmap/the byte slices also swaps out the syscall-based
file interface with a truly random access one, which allows for faster
concurrent reads from the underlying backendmmap one or multiple files
on the mounted file system instead of mmaping the block
device directlyReadWriterAt pipeline
ReadWriterAt, combining an io.ReaderAt and a
io.WriterAtSize and Sync
syscalls directly to the underlying backend, but wrap a backend’s
ReadAt and WriteAt methods in a pipeline of
other ReadWriterAtsReadWriterAt is the
ArbitraryReadWriterAt (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/arbitrary_rwat.go)ReadAt, it calculates the index of the chunk that
the offset falls into and the position within the offsetsChunkedReadWriterAt, which ensures that the limits
concerning the maximum size supported by the backend and the actual
chunks are being respectedPuller component asynchronously pulls chunks in the
background (code snipped from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/puller.go)ReaderAt, which is
then expected to handle the actual copying on its ownSyncedReadWriterAt insteadReadWriterAtReadAt, it is tracked and market as remote by adding it to
a local mapReadWriterAt, and is then marked as locally
available, so that on the second read it is fetched locally
directlyPuller,
this also means that if a chunk which hasn’t been fetched asynchronously
yet will be scheduled to be pulled immediatelySyncedReadWriterAt and the
Puller component implements the pull post-copy system in a
modular and testable wayPuller interface, it is possible to
implement a read-only managed mountrr+ prefetching mechanism from
“Remote Regions” (reference atc18-aguilera)Sync in a set recurring interval (code snippet
from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/pusher.go)ReadWriter, and copies it to the remoteMarkOffsetPushable method (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L171C24-L185)ReadWriterAt
pipelineReadAt and
WriteAtReadAt is a simple proxy, while WriteAt
also marks a chunk as pushable (since it mutates data) before writing to
the local ReadWriterAtPusher step is simply
skipped (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L142-L169)ArbitraryReadWriter is used (graphic of the four systems
and how they are connected to each other vs. how the direct mounts
work)ReaderWriterAts and a simple table-driven test can
be created (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/puller_test.go)ReadWriterAt before the NBD device is openPuller is simply skipped (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L187-L222)mmap interfaces, the managed mount
builds on this interface in order to provide the same interfacesSync() API,
e.g. the msync on the mmaped file must happen
before Sync() is called on the syncerCloseing the mountReadWriterAt
components allow the reuse of lots of code for both the mount API and
the migration APIReadAt methods, but also new APIs such as returning dirty
chunks from Sync and adding a Track method
(code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder.go#L15-L21)TrackingReadWriterAt that is connected to the seeder’s
ReadWriterAt pipelineTrack, the tracker intercepts all
WriteAt calls and adds them to a local de-duplicated store
(code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/tracking_rwat.go#L28-L40)Sync is called, the changed chunks are returned
and the de-duplicated store is clearedservice
utility struct struct from Open (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder.go)LockableReadWriterAt, into its internal pipeline (add
graphic of the internal pipeline)Finalize has been called (code snippet
from
https://github.com/pojntfx/r3map/blob/main/pkg/chunks/lockable_rwat.go#L19-L37)Finalize did not yet mark the changed chunks) could have
poisoned the cache on the e.g. mmaped deviceFinalize can be calledFinalize then calls Sync() on the remote,
marks the changed chunks as remote, and schedules them also be pulled in
the background (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/migration/path_leecher.go#L257-L280)ReadWriterAt to make accessing the path/file/slice too
early harder, only Finalize returns the managed object, so
that the happy path can less easily lead to deadlocksClose on the seeder and disconnects the leecher from
the seeder, causing both to shut down (code snippet from
https://github.com/pojntfx/r3map/blob/main/cmd/r3map-migration-benchmark-server/main.go#L137)ReadAt requests’ latency is an important metric
since it strongly correlates with the total latencyReadAt and WriteAt are implemented using
Cassandra’s query language (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/backend/cassandra.go#L36-L63)dialable
dialable from the
sourceReadAt, is called, it is looked up
via reflection and validated (code snippet from
https://github.com/pojntfx/dudirekta/blob/main/pkg/rpc/registry.go#L323-L357)go-nbd backend
interface is implemented for the remote representation, creating a
universally reusable, generic RPC backend wrapper (code snippet from
https://github.com/pojntfx/r3map/blob/main/pkg/backend/rpc.go)dialing the server would mean that the server could not
reference the multiple client connections as one composite client
without changes to the protocolproto3 DSLReadAt RPCs doesn’t require any encoding (wereas JSON, used
by dudirekta, base64 encodes these chunks)userfaultfdmsyncuserfaultfd we are able to map almost any
object into memoryuserfaultfd API socket is also synchronous, so each
chunk needs to be sent one after the other, meaning that it is very
vulnerable to long RTT valuesuserfaultfd, this system also has
limitationsuserfaultfd was only able to catch reads, this
system is only able to catch writes to the fileOpen() time and thus reduce the overhead of mounting a
remote ressourceublk as an alternative
ublk
could also be usedublk uses io_uring, which means that it
could potentially allow for much faster concurrent accessublk driver that
creates /dev/ublkb* devicesublk is able
to use io_uring pass-through commandsio_uring architecture promises lower latency and
better throughputmmap API is used, it is possible that the GC
tries to manage the underlying slice, or tries to release memory as data
is being copied from the mountmmaped region into
memory, but this will also cause all chunks to be fetched, which leads
to a high Open() latencymmapingram-dlram-ul “uploads” RAM by exposing a memory, file or
directory-backed file over fRPCram-dl then does all of the abovetapiskbbolt DB is used as an indexMTSEEK ioctl to
fast-forward to a record on the tape (code snippet from
https://github.com/pojntfx/tapisk/blob/main/pkg/mtio/tape.go#L25-L40)tapisk is a unique usecase that shows the versitility
of the approach chosen and how flexible it isReadWriterAt, in the same way as the directory backendtars3-fuses3-fuse) when they
are communicated to the FUSE by the kernel, which makes writes very
slowinotify support, missing permissions etc.)>0mmaping it directly to fetch it[]byte[]byte
from the block device, we can use the managed mount/direct
mount/migration APIs to send/receive them or mount them[]byte, which by definition every
state is (def. a process level or VM level etc.)userfaultfd is an interesting API and very idiomatic to
both Linux as a first-party solution, and Go due to its fairly low
implementation overhead, but does fall short in throughput, especially
when used in WAN networks, where other options provide better
performancemmaped files does
provide a simple way of memory synchronization for specific scenarios,
but does have a very significant I/O and compute overhead making it
unsuitable for most applicationsublk, could further improve on the
implementation presentedAPI: Application Programming Interface I/O: Input/Output IO: Input/Output OS: Operating System CPU: Central Processing Unit RAM: Random Access Memory SSD: Solid State Drive HDD: Hard Disk Drive UUID: Universally Unique Identifier CRC32: Cyclic Redundancy Check 32-Bit LRU: Least Recently Used WAN: Wide Area Network LAN: Local Area Network TCP: Transmission Control Protocol UDP: User Datagram Protocol P2P: Peer-To-Peer NATs: Network Address Translators IPC: Inter-Process Communication RTT: Round-Trip Time SRP: SCSI RDMA Protocol GNU: GNU’s Not Unix UNIX: UNIX Family of Operating Systems macOS: Apple Macintosh Operating System FreeBSD: Free Berkeley Software Distribution NBD: Network Block Device S3fs: S3 File System NVMe: Non-Volatile Memory Express LTFS: Linear Tape File System LTO: Linear Tape-Open EXT4: Fourth Extended Filesystem Btrfs: B-Tree File System C: C Programming Language Rust: Rust Programming Language Go: Go Programming Language C++: C++ Programming Language ARM: ARM RISC Computer Processor Architecture x86: x86 CISC Computer Processor Architecture RISC-V: RISC-V RISC Computer Processor Architecture LPDDR5: Low-Power Double Data Rate 5 HTTP: Hypertext Transfer Protocol HTTPS: HTTP Secure HTTP/2: HTTP Version 2 QUIC: Quick UDP Internet Connections WebRTC: Web Real-Time Communication Wasm: WebAssembly IETF: Internet Engineering Task Force OIDC: OpenID Connect AWS: Amazon Web Services CNCF: Cloud Native Computing Foundation S3: Simple Storage Service TLS: Transport Layer Security mTLS: Mutual TLS SSH: Secure Shell DoS: Denial of Service JSON: JavaScript Object Notation JSONL: JSON Lines SQL: Structured Query Language NoSQL: Not Only SQL Protobuf: Protocol Buffers IDL: Interface Definition Language DSL: Domain-Specific Language KV: Key-Value Syscalls: System Calls VM: Virtual Machine RPC: Remote Procedure Call REST: Representational State Transfer FUSE: File Systems in Userspace