Efficient Synchronization of Linux Memory Regions over a Network: A Comparative Study and Implementation (Notes) Felicitas Pojtinger (Stuttgart Media University) 2023-08-03 - 1 Introduction - 2 Technology - 2.1 User Space and Kernel Space - 2.2 The Linux Kernel - 2.3 Linux Kernel Modules - 2.4 UNIX Signals and Handlers - 2.5 UNIX Sockets - 2.6 Memory Hierarchy - 2.7 Memory Management in Linux - 2.8 Swap Space - 2.9 Page Faults - 2.10 mmap - 2.11 inotify - 2.12 Linux Kernel Disk and File Caching - 2.13 LAN and WAN - 2.14 TCP, UDP, TLS and QUIC - 2.15 Delta Synchronization - 2.16 File Systems In Userspace (FUSE) - 2.17 Network Block Device (NBD) - 2.18 Virtual Machine Live Migration - 2.18.1 Pre-Copy - 2.18.2 Post-Copy - 2.18.3 Workload Analysis - 2.19 Streams and Pipelines - 2.20 Go - 2.21 gRPC - 2.22 fRPC and Polyglot - 2.23 Redis - 2.24 S3 and Minio - 2.25 Cassandra and ScylllaDB - 3 Planning - 3.1 Pull-Based Synchronization With userfaultfd - 3.2 Push-Based Synchronization With mmap and Hashing - 3.3 Push-Pull Synchronization with FUSE - 3.4 Mounts with NBD - 3.5 Push-Pull Synchronization with Mounts - 3.6 Pull-Based Synchronization with Migrations - 4 Implementation - 4.1 Userfaults in Go with userfaultfd - 4.2 File-Based Synchronization - 4.3 FUSE Implementation in Go - 4.4 NBD with go-nbd - 4.5 Mounts - 4.6 Live Migration - 4.7 Pluggable Encryption and Authentication - 4.8 Optimizing Backends For High RTT - 4.9 Using Remote Stores as Backends - 4.10 Bi-Directional and Concurrent RPCs with Dudirekta - 4.11 Connection Pooling with gRPC - 4.12 Optimizing RPC Throughput and Latency with fRPC - 5 Results - 5.1 Testing Environment - 5.2 Access Methods - 5.3 Initialization - 5.4 Chunking - 5.5 RPC frameworks - 5.6 Backends - 6 Discussion - 6.1 Userfaults - 6.2 File-Based Synchronization - 6.3 FUSE - 6.4 Direct Mounts - 6.5 Managed Mounts - 6.6 Chunking - 6.7 RPC Frameworks - 6.8 Backends - 6.9 NBD - 6.10 Language Limitations - 6.11 Use Cases - 6.11.1 Remote Swap With ram-dl - 6.11.2 Mapping Tape Into Memory With tapisk - 6.11.3 Improving File System Synchronization Solutions - 6.11.4 Universal Database, Media and Asset Streaming - 6.11.5 Universal App State Synchronization and Migration - 7 Summary - 8 Conclusion - 9 References - 9.1 Abbrevations - 9.2 Structure - Citations Introduction - Today, there are many ways of accessing a remote resource: File, database or custom API - Similarly so, synchronizing a resource is a problem that is handled at a case-by-case basis - Using a 3rd party (database, OS service for file synchronization) - Custom synchronization protocol - Migration of a resource is also very complex task - Moving a resource between two hosts often uses APIs that are really meant for synchronization (i.e. databases) - Slow and require internet access - Almost always proprietary to an app - This is despite a resource being almost always represented by a memory region or file - What if instead of accessing, synchronizing and migrating a resource with application-specific protocols and abstraction, we could do it universally, for any memory region? - Problems with the current state - While there are ways to access, synchronize and migrate memory regions today, they are niche solutions, confined to narrow use cases like virtual machine live migration - These solutions are also hindered by a lack of a universal API for using them, being made for a specific application instead of a general memory synchronization use case - Overview of the structure - This thesis explores different options of implementing such a universal way of accessing, synchronizing and migrating data - It shows how to implement select approaches with APIs such as userfaultfd and NBD, which challenges exist, which optimizations can be done and how such a universal API and related wire protocols can look like - It also compares the performance between different configurations to find the optimal use case for each approach, while not being limited to either WAN or LAN deployments - Finally, it presents a comprehensive implementation of a universal way of accessing, synchronizing and migrating data with various use cases and benchmarking data which proves that it can be used for real-world applications Technology User Space and Kernel Space - Kernel - Core of the operating system - Direct interacts with hardware - Manages system ressources (CPU time, memory etc.) - Enforces security policies - Responsible for progress scheduling, memory management and drivers - Kernel space is the memory region in which the kernel is stored and executes in - User space - Portion of system memory where user mode applications execute - Applications can’t directly access hardware or kernel memory, will need to use an API to get access[1] - Syscalls - Serve as the API between user space and kernel space - Makes it possible for applications in user space to access low-level OS services - Common examples are open(), read(), write() and close() - ioctl - Is a fairly generic syscall - Allows for device-specific actions that can’t be done with or expressed by regular system calls, providing a universal I/O model[2] - Can be an implementation hurdle for languages because they are defined by using numeric constants and arbitrary pointers, making type safety an issue [3] The Linux Kernel - Open-Source kernel created by Linus Torvals in 1991 - Written in C, with recently Rust being added as an additional allowed language[4] - Powers millions of devices worldwide (servers, desktops, phones, embedded devices) - Is a bridge between applications and hardware - Provides an abstraction layer - Compatible with many architectures (ARM, x86, RISC-V etc.) - Is not an operating system in itself, but is usually made an operating system by a distribution[5] - Distributions add userspace tools (e.g. GNU coreutils or BusyBox), desktop environments and more, turning into a full operating system - Is a good fit for this thesis due to it’s open-source nature (allows anyone to view, modify and contribute to the source code) Linux Kernel Modules - Linux kernel is monolithic, but extensible thanks to kernel modules[5] - Small pieces of kernel-level code that can be loaded and unloaded as kernel modules - Can extend the kernel functionality without reboots - Are dynamically linked into the running kernel - Helps keep kernel size manageable and maintainable - Kernel modules are written in C - Interact with kernel through APIs - Poorly written modules can cause signficant kernel instability - Modules can be loaded at boot time or dynamically (modprobe, rmmod etc.)[6] - Module lifecycle can be implemented with initialization and cleanup functions UNIX Signals and Handlers - Signals - Are software interrups that notify a process of important events (exceptions etc.) - Can originate from the kernel, user input or different processes - Function as an asynchronous communication mechanism between processes or the kernel and a process - Have default actions, i.e. terminating the process or ignoring a signal[7] - Handlers - Can be used to customize how a process should respond to a signal - Can be installed with sigaction()[8] - Signals are not designed as an IPC mechanism, since they can only alert of an event, but not of any additional data for it UNIX Sockets - Overview - Allow processes on the same host system to communicate with each other[9] - Unlike UNIX signals, they can be used for IPC with addition data - Are also available in Linux - Types - Stream sockets use TCP to provide reliable, two-way, connection-based byte streams, making them optimal for applications that need strong consistency - Datagram sockets use UDP, and thus allow for connection-less communication with less guarantees, prioritizing speed - Named and unnamed - Named UNIX sockets are a special file type on the file system and can be identified as a path, and allow easy communication between unrelated processes - Unnamed UNIX sockets exist only in memory and disappear after the creating process terminates - UNIX sockets allow file descriptor passing (see section on userfaultfd), which allows sharing a file descriptor between processes Memory Hierarchy - Memory in computers can be classified based on size, speed, cost and proximity to the CPU - Principle of locality: The most frequently accessed data and instructions should be in the closest memory[10] - Locality is important mostly due to the “speed of the cable” - throughput (due to dampening) and latency (due to being limited by the speed of light) decreases as distance increases - Registers - Closest to the CPU - Very small amount of storage (32-64 bits of data) - Used by the CPU to perform operations - Very high speed, but limited in storage size - Cache Memory - Divided into L1, L2 and L3 - The higher the level, the larger and less expensive a layer - Buffer for frequently accessed data - Predictive algorithms optimize data usage - Main Memory (RAM) - Offers larger capacity than cache but is slower - Stores programs and open files - Secondary Storage (SSD/HDD) - Slower but RAM but can store larger amounts of memory - Typically stores the OS etc. - Is persistent (keeps data after power is cut) - Tertiary Storage (optical disks, tape) - Slow, but very cheap - Tape: Can store very large amounts of data for relatively long amounts of time - Typically used for archives or physical data transport (e.g. import from personal infrastructure to AWS) - Evolution of the hierarchy - Technological advancements continue to blur this clear hierarchy[11] - E.g. NVMe rivals RAM speeds but can store larger amounts of data - This thesis also blurs the hierarchy by exposing e.g. tertiary or secondary storage with the same interfae as main memory Memory Management in Linux - Memory management is a crucial part of every operation system - maybe even the whole point of an operating system - Creates buffer between applications and physical memory - Can provide security guarantees (e.g. only one process can access it’s memory) - Kernel space - Runs the kernel, kernel extensions, device drivers - Managed by the kernel memory module - Uses slab allocation (groups objects of the same size into caches, speeds up memory allocation, reduces fragmentation of the memory)[12] - User space - Applications (and some drivers) store their memory here[13] - Managed through a paging system - Each application has it’s own private virtual address space - Virtual address space divided into pages of 4 KB - Pages can be mapped to any “real” location in physical memory Swap Space - A portion of the secondary storage is for virtual memory[13] - Essential for systems running multiple applications - Moves inactive parts of ram to secondary storage to free up space for other processes - Implementation in Linux - Linux uses a demand paging system: Memory is only allocated when it is needed - Can be either a swap partition (separate area of the secondary storage) or file (regular file that can be expanded/trucnated) - Swap paritions and files are transparent to use - Kernel uses a LRU algoruithm for deciding which pages - Role in hiberation - Before hibernating, the system saves the content of RAM into swap (where it is persistent) - When resuming, memory is read back from swap - Role on performance - If swap is used too heavily, since the secondary storage is usually slower than primary memory, it can lead to signficant slowdowns - “Swapiness” can be set for the kernel, which controls how likely the system is to swap memory pages Page Faults - Page faults occur when the process tries to access a page not available in primary memory, which ause the OS to swap the required page from secondary storage into primary memory[6] - Types - Minor page faults: Page is already in memory, but not linked to the process that needs it - Major page fault: Needs to be loaded from secondary storage - The LRU (and simpler clock algorithm) can minimize page faults - Techniques for handling page faults - Prefetching: Anticipating future page requests and loading them into memory in advance - Page compression: Compressing inactive pages and storing them in memory pre-emptively (so that less major faults happen)[14] - Usually, handling page faults is something that the kernel does mmap - Overview - UNIX system call for mapping files or devices into memory - Multiple possible usecases: Shared memory, file I/O, fine-grained memory allocation - Commonly used in applications like databases - Is a “power tool”, needs to be used carefully and intentionally - Functionality - Establishes direct link (memory mapping) between a file and a memory region[15] - When the system reads from the mapped memory region, it reads from the file directly and vice versa - Reduces overhead since no or less context switches are needed - Benefits: - Enables zero-copy operations: Data can be accessed directly as though it were in memory, without having to copy it from disk first - Can be used to share memory between processes without having to go through the kernel with syscalls[7] - Drawback: Bypasses the file system cache, which can lead to stale data if multiple processes read/write at the same time inotify - Event-driven notification system of the Linux kernel[16] - Monitors file system for events (i.e. modifications, access etc.) - Uses a watch feature for monitoring specific events, e.g. only watching writes - Reduces overhead and ressource use compared to polling - Widely used in many applications, e.g. Dropbox for file synchronization - Has limitations like the limit on how many watches can be established Linux Kernel Disk and File Caching - Disk caching - Temporarely stores frequently accessed data in RAM - Uses principle of locality (see Memory Hierarchy) - Implemented using the page cache subsystem in Linux - Uses the LRU algorithm to manage cache contents - File caching - Linux caches file system metadata in the dentry and inode caches - Metadata includes i.e. file names, attributes and locations - This caching accelerates the resoluton of path names and file attributes (i.e. the last change data for polling) - File reads/writes pass through the disk cache - Complexities - Data consistency: Between the disk and cache via writebacks. Aggressive writebacks lead to reduced performance, delays risk data loss - Release of cached data under memory pressure: Cache eviction requires intelligent algorithms, i.e. LRU[6] LAN and WAN - RTT - Round trip time (RTT) represents the time data takes to travel from a source to a destination and back - Provides an insight into application latency - Varies according to many factors such as network type, system load and physical distance - LAN - Local area networks (LAN) is network infrastructure in geographically small areas[17] - Often the network of a single data center or office - Typically have a low RTT and thus latency due to a short number of relays and the small distances that need to be travelled - Perimeter security is often applied, where a LAN is seen as a trusted network that doesn’t require authentication between internal systems - WAN - Wide area networks (WAN) span large geographical areas - Most famous example is the open internet - Due to physical distance and number of hops, often higher RTT and latency - Typically open to wiretapping and packet inspection, requiring encryption and authentication for services to operate TCP, UDP, TLS and QUIC - TCP - Connection-oriented - Has been the reliable backbone of internet commmunication - Guaranteed delivery and maintained data order - Includes error checking, lost packet retransmission, and congestion control mechanisms - Powers the majority of the web[18] - UDP - Connectionless - No reliability or ordered packet delivery guarantees - Faster than TCP due to less guarantees - Suitable for applications that require speed over reliability (i.e. online gaming, video calls etc.)[19] - TLS - Encryption protocol that intents to secure communication over a public network like the internet, has been updated many times to evolve with time - Uses both symmetric and asymmetric encryption[20] - Is used for most internet communication, esp. in combination with HTTP in the form of HTTPS - Consists of a handshake phase, in which the parameters necessary to establish a secure connection are exchanged, as well as session keys and certificates, before continuing on to the data transfer phase using the encrypted connection - Can also be used for authentication through the use of mutual TLS (mTLS), where the client is authenticated by the server - QUIC - Modern transport layer protocol developed by Google and standardized by the IETF in 2020 - Intents to combine the best aspects of TCP and UDP - Provides reliability and ordered delivery guarantees - Reduces connection establishment times/initial latency by combining connection and security handshakes - Avoids head-of-line blocking by allowing independent delivery of separate streams[21] Delta Synchronization - Traditionally, when files are synchronized between hosts, the entire file is transfered - Delta synchronization is a technique that intents to instead transfer only the part of the file that has changed - Can lead to reduced network and I/O overhead - The probably most popular tool for file synchronization like this is rsync - When a delta-transfer algorithm is used, it computes the difference between the local and the remote file, and then synchronizes the changes - The delta sync algorithm first does file block division - The file on the destination is divided into fixed-size blocks - For each block in the destination, a weak and fast checksum is calculated - The checksums are sent over to the source - On the source, the same checksum calculation process is run, and compared against the checksums that were sent over (matching block identification) - Once the changed blocks are known, the source sends over the offset of each block and the changed block’s data to the destination - When a block is received, the destination writes the chunk to the specified offset, reconstructing the file - Once one polling interval is done, the process begins again[22] File Systems In Userspace (FUSE) - Software interface that allows writing custom file systems in userspace - Developers can create file systems without having to engage in low-level kernel development - Available on multiple platforms, mostly Linux but also macOS and FreeBSD - In order to implement file systems in user space, we can use the FUSE API - Here, a user space program registers itself with the FUSE kernel module - This program provides callbacks for the file system operations, e.g. for open, read, write etc. - When the user performs a file system operation on a mounted FUSE file system, the kernel module will send a request for the operation to the user space program, which can then reply with a response, which the FUSE kernel module will then return to the user - This makes it much easier to create a file system compared to writing it in the kernel, as it can run in user space - It is also much safer as no custom kernel module is required and an error in the FUSE or the backend can’t crash the entire kernel - Unlike a file system implemented as a kernel module, this layer of indirection makes the file system portable, since it only needs to communicate with the FUSE module - Does have significant performance overhead due to the context switching between the kernel and the file system in userspace[23] - Is used for many high-level interfaces to extenal services, i.e. to mount S3 buckets or a remote system’s disk via SSH Network Block Device (NBD) - NBD uses a protocol to communicate between a server (provided by user space) and a client (provided by the NBD kernel module) - The protocol can run over WAN, but is really mostly designed for LAN or localhost usage - It has two phases: Handshake and transmission[24] - There are multiple actors in the protocol: One or multiple clients, the server and the virtual concept of an export - When the client connects to the server, the server sends a greeting message with the server’s flags - The client responds with its own flags and an export name (a single NBD server can expose multiple devices) to use - The server sends the export’s size and other metadata, after which the client acknowledges the received data and the handshake is complete - After the handshake, the client and server start exchanging commands and replies - A command can be any of the basic operations needed to access a block device, e.g. read, write or flush - Depending on the command, it can also contain data (such as the chunk to be written), offsets, lengths and more - Replies can contain an error, success value or data depending on the reply’s type - NBD is however limited in some respects; the maximum message size is 32 MB, but the maximum block/chunk size supported by the kernel is just 4096 KB, making it a suboptimal protocol to run over WAN, esp. in high latency scenarios - The protocol also allows for listing exports, making it possible to e.g. list multiple memory regions on a single server - NBD is an older protocol with multiple different handshake versions and legacy features - Since the purpose of NBD in this use case is minimal and both the server and the client are typically controlled, it makes sense to only implement the latest recommended versions and the baseline feature set - The baseline feature set requires no TLS, the latest “fixed newstyle” handshake, the ability to list and choose an export, as well as the read, write and disc(onnect) commands and replies - As such, the protocol is very simple to implement - With this simplicity however also come some drawbacks: NBD is less suited for use cases where the backing device behaves very differently from a random-access store device, like for example a tape drive, since it is not possible to work with high-level abstractions such as files or directories - This is, for the narrow memory synchronization use case, however more of a feature than a bug - Since it works on the block level, it can’t offer shared acesses to the same file for multiple clients Virtual Machine Live Migration Pre-Copy - While these systems already allow for some optimizations over simply using the NBD protocol over WAN, they still mean that chunks will only be fetched as they are being needed, which means that there still is a guaranteed minimum downtime - In order to improve on this, a more advanced API (the managed mount API) was created - A field that tries to optimize for this use case is live migration of VMs - Live migration refers to moving a virtual machine, its state and connected devices from one host to another with as little downtime as possible - There are two types of such migration algorithms; pre-copy and post-copy migration - Pre-copy migration works by copying data from the source to the destination as the VM continues to run (or in the case of a generic migration, app/other state continues being written to) - First, the initial state of the VM’s memory is copied to the destination - If, during the push, chunks are being modified, they are being marked as dirty - These dirty chunks are being copied over to the destination until the number of remaining chunks is small enough to satisfy a maximum downtime criteria - Once this is the case, the VM is suspended on the source, and the remaining chunks are synced over to destination - Once the transfer is complete, the VM is resumed on the destination - This process is helpful since the VM is always available in full on either the source or the destination, and it is resilient to a network outage occurring during the synchronization - If the VM (or app etc.) is however changing too many chunks on the source during the migration, the maximum acceptable downtime criteria might never get reached, and the maximum acceptable downtime is also somewhat limited by the available RTT[25] Post-Copy - An alternative to pre-copy migration is post-copy migration - In this approach, the VM is immediately suspended on the source, moved to the destination with only a minimal set of chunks - After the VM has been moved to the destination, it is resumed - If the VM tries to access a chunk on the destination, a page fault is raised, and the missing page is fetched from the source, and the VM continues to execute - The benefit of post-copy migration is that it does not require re-transmitting dirty chunks to the destination before the maximum tolerable downtime is reached - The big drawback of post-copy migration is that it can result in longer migration times, because the chunks need to be fetched from the network on-demand, which is very latency/RTT-sensitive[25] Workload Analysis - “Reducing Virtual Machine Live Migration Overhead via Workload Analysis” provides an interesting analysis of options on how this decision of when to migrate can be made[26] - While being designed mostly for use with virtual machines, it could serve as a basis for other applications or migration scenarios, too - The proposed method identifies workload cycles of VMs and uses this information to postpone the migration if doing so is beneficial - This works by analyzing cyclic patters that can unnecessarily delay a VM’s migration, and identifies optimal cycles to migrate VMs in from this information - For the VM use case, such cycles could for example be the GC of a large application triggering a lot of changes to the VMs memory etc. - If a migration is proposed, the system checks for whether it is currently in a beneficial cycle to migrate in which case it lets the migration proceed; otherwise, it postpones it until the next cycle - The algorithm uses a Bayesian classifier to identify a favorable or unfavorable cycle - Compared to the alternative, which is usually waiting for a significant percentage of the chunks that were not changed before tracking started to be synced first, this can potentially yield a lot of improvements - The paper has found an improvement of up to 74% in terms of live migration time/downtime and 43% in terms of the amount of data transferred over the network - While such a system was not implemented for r3map, using r3map with such a system would certainly be possible Streams and Pipelines - Fundamental concepts in computer science - Sequentually process elements - Allow for the efficient processing of large amounts of data, without having to load everything into memory - Form the backbone of efficient, modular data processing - Streams - Represent a continous sequence of data - Can be a source or destination of data (i.e. files, network connections, stdin/stdout etc.) - Allow processing of data as it becomes available - Minimized memory consumption - Especially well suited for long-running processes (where data gets streamed in for a extended time)[27] - Pipelines - Series of data processing stages: Output of one stage serves as input to the next[28] - Stages can often be run in parallel, improving performance due to a higher degree of concurrency - Example: Instruction pipeline in CPU, were the stages of instruction execution can be performend in parallel - Example: UNIX pipes, where the output of a command (e.g. curl) can be piped into another command (e.g. jq) to achieve a larger goal Go - Overview - Is a statically typed, compiled, open-source language by Google launched in 2009[29] - Known for it’s simplicity - Developed to address the unsuitability of traditional languages for modern, distributed systems development - Built with input from many original UNIX developers, like Rob Pike and Ken Thompson - Popular particularly in cloud services and network programming - Goroutines - Headline feature of Go - Enables concurrent function execution - Similar to threads conceptually, but much more scalable to millions of Goroutines per program - Synchronization and data transfer between Goroutines is done by channels, typed, concurrency-safe conduits for data gRPC - Open-Source and high-performance RPC framework - Developed by Google in 2015 - Features - Uses HTTP/2 as the transport protocol to benefit from header compression and request multiplexing - Uses protobuf as the IDL and wire format, a high-performance, polyglot mechanism for data serialization (instead of the slower and more verbose JSON of REST APIs) - Supports unary RPCs, server-streaming RPCs, client-streaming RPCs and bidirectional RPCs - Has pluggable support for load balancing, tracing, health checking and authentication[30]j - Supports many languages (Go, Rust, JS etc.) - Developed by the CNCF fRPC and Polyglot - fRPC - Is an Open-Source RPC framework released by Loophole Labs in 2022 - proto3-compatible, meaning that it can be used as a drop-in replacement for gRPC that promises better performance characteristics - Has the ability to stop the RPC system and retrieve an underlying connection, making it possible to re-use connections for different purposes [31] - Uses frisbee, a messaging framework to implement the request-response semantics [32] - Polyglot - Is the high-performance serialization framework used by fRPC - Works similarly to protocol buffers, but is much simpler and carries less legacy code - Is also language independent, with current implementations for Go, Rust and TypeScript[33] Redis - In-memory data structure store - Used as a database, cache and/or message broker - Created by S. Sanfilippo in 2009 - Different from other NoSQL databases by supporting various data structures like lists, sets, hashes or bitmaps - Uses in-memory data storage for maximum speed and efficiency[34] - Allows for low-latency reads/writes - While not intended for persistance, it is possible to store data on disk - Has a non-blocking I/O model and offers near real-time data processing capabilities - Includes a pub-sub system to be able to function as a message broker[35] S3 and Minio - S3 - Object storage service for data-intensive workloads - Offered by AWS - Can be globally distributed to allow for fast access times from anywhere on the globe - Range of storage classes with different requirements - Includes authentication and authorization - Exposes HTTP API for accessing the stored folders and files[36] - Minio - Open-source storage server compatible wiht S3 - Lightweight and simple, written in Go - Can be hosted on-prem and is open source - Allows horizontal scalability to storage large amounts of data across nodes[37] Cassandra and ScylllaDB - Popular wide-column NoSQL databases - Combines Amazon’s Dynamo model and Google’s Bigtable model to create a highly available database - Apache Cassandra - Highly scalable, eventually consistent - Can handle large amounts of data across many servers with no single point of failure - Consistency can be tuned according to needs (eventual to strong) - Doesn’t use master nodes due to it’s use of a P2P protocol and distributed hash ring design - Does have high latency under heavy load and requires fairly complex configuration[38] - ScyllaDB - Launched in 2015 - Written in C++ and has a shared-nothing architecture, unlike Cassandra which is written in Java - Compatible with Cassandra’s API and data model[39] - Designed to overcome Cassandra’s limitations esp. around latency, esp. P99 latency - Performance improvements were confirmed with various benchmarking studies[40] Planning Pull-Based Synchronization With userfaultfd - An implementation of post-copy migration - Memory region is created on the destination host - Reads to the region by the migrated ressource trigger a page fault - When we encounter such a page fault, we want to fetch the relevant offset from the remote - Typically, page faults are resolved by the Kernel - Here, we want to handle handle the page faults in user space - Traditionally, it was possible to use SIGSEGV signal handlers to use handle this from the program - This however is complicated and slow - Instead we use a new kernel API (“Userfaults”) - We register the region to be handled by userfault - Then we start an handler that fetches the offsets from the remote on a page fault - This handler is connected to the registered region using a file descriptor - We can transfer the handler’s file descriptor between processes over a socket Push-Based Synchronization With mmap and Hashing - As mentioned before mmap allows mapping a memory region to a file - We can put the migratable ressource into this memory region - If we get writes to the region, we eventually get writes to the region - If we’re able to detect these writes and copy them to the destination, we can implement a pre-copy migration system - mmaped regions still using caching to speed up reads - Changes from the region can be flushed to the disk with msync - Usually we could use inotify to detect changes to the file, but inotify doesn’t work with mmaped regions - As a result, we went for a polling-based solution instead - Polling has drawbacks, which we tried to work around upon in our implementation Push-Pull Synchronization with FUSE - Can serve as the basis for a pre- or post-copy migration - Similarly to the file-based synchronization we mmap a file into memory - The file is stored on a custom file system - We then catch reads (for a post-copy migration) or writes (for a pre-copy migration) - Implementing a file system in the kernel is possible but cumbersome as described before - With FUSE we can implement the file system in user space, making this much simpler, as we’ll show in the implementation Mounts with NBD - Another mmap-based solution for pre- and/or post-copy migration - Instead of mmaping a file, a device is mmaped - This device is a block device provided by a NBD client - We can then connect the NBD client to a remote NBD server, which contains the migratable resource - Any reads to/from the region go to/from the block device and are resolved by the remote NBD server - Isn’t per se a synchronization methods on it’s own, but rather a remote mount - Unlike a FUSE this means we only need to implement a block device, not a full file system - Implementation and performance overhead of a NBD server is fairly low as we’ll show in the implementation however Push-Pull Synchronization with Mounts - Also tracks changes to the memory region of the migratable ressource using NBD - Limitations of the NBD protocol in WAN - Usually, the NBD server and client don’t run on the same system - NBD was originally designed to used as a LAN protocol to access a remote hard disk - As mentioned before, NBD can run over WAN, but is not designed for this - The biggest problem with running NBD over a public network, even if TLS is enabled is latency - Individual chunks would only be fetched to the local system as they are being accessed, adding a guaranteed minimum latency of at least the RTT - Instead of directly connecting a client to a remote server, we add a layer of indirection, called a Mount that consists of both a client and a server, both running on the local system - Combining the NBD server and client to a reusable unit - We then connect the server to the backend with an API that is better suited for WAN usage - This also makes it possible to implement smart pull/push strategies instead of simply directly writing to/from the network (“managed mounts”) - The mount WAN protocol - The simplest form of the mount API is the direct mount API - This API simply swaps out NBD for a transport-independent RPC framework, but does not do additional optimizations - It has two simple actors: A client and a server, with only the server providing methods to be called (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/services/backend.go#L14-L19) - The protocol as such is stateless, as there is only a simple remote read/write interface (add state machine and sequence diagram here) - Chunking - One additional layer that needs to be implemented however is proper chunking support - While we can specify a chunk size for the NBD client in the form of a block size, we can only go up to 4 KB chunks - For scenarios where the RTT between the backend and server is large, it might make sense to use a much larger chunk size for the actual networked transfers - Many backends also have constraints that prevent them from functioning without a specific chunk size or aligned offsets, such as using e.g. tape drives, which require setting a block size and work best when these chunks are multiple MBs instead of KBs - Even if there are no constraints on chunking on the backend side (e.g. when a file is used as the backend), it might make sense to limit the maximum supported message size between the client and server to prevent DoS attacks by forcing the backend to allocate large chunks of memory to satisfy requests, which requires a chunking system to work - Server-side vs. client-side chunking - It is possible to do the chunking in two places; on the mount API’s side, and on the (potentially remote) backend’s side - Doing the chunking on the backend’s side is usually much faster than on the mount API’s side, as writes with lengths smaller than the chunk size will mean that the remote chunk needs to be fetched first, significantly increasing the latency esp. in scenarios with high RTT - Combining pre- and post-copy migration with managed mounts - For the managed mount API, both paradigms were implemented - The managed mount API is primarily intended as an API for reading from a remote resource and then syncing changes back to it however, not migrating a resource between two hosts - The migration API (will be discussed later) is a more well-optimized version for this use case - Background pull - The pre-copy API is implemented in the form of preemptive pulls based on another ReaderAt - It allows passing in a pull heuristic function, which it uses to determine which chunks should be pulled in which order - Many applications commonly access certain bits of data first - If a resource should be available locally as quickly as possible, then using the correct pull heuristic can help a lot - For example, if the data pulled consists of a header, then using a pull heuristic that pulls these header chunks first can be of help - If a file system is being synchronized, and the superblocks of the file system are being stored in a known pattern, the pull heuristic can be used to fetch these superblocks first - If a format like MP4, which has an index, is used then said index can be fetched first, be accessed first and during the time needed to parse the index, the remaining data can be pulled in the background - Background push - In order to also be able to write back however, it needs to have a push system as well - This push system is being started in parallel with the pull system - It also takes a local and a remote ReadWriterAt - This integrates with the callbacks supplied by the syncer, which ensures that we don’t sync back changes that have been pulled but not modified, only the ones that have been changed locally Pull-Based Synchronization with Migrations - Also tracks changes to the memory region of the migratable ressource using NBD - Optimization mounts for migration scenarios - We have now implemented a managed mounts API - This API allows for efficient access to a remote resource through memory - It is however not well suited for a migration scenario - For migrations, more optimization is needed to minimize the maximum acceptable downtime - For the migration, the process is split into two distinct phases - The same preemptive background pulls and parallelized device/syncer startup can be used, but the push process is dropped - The two phases allow pulling the majority of the data first, and only finalize the move later with the remaining data - This is inspired by the pre-copy approach to VM live migration, but also allows for some of the benefits of the post-copy approach as we’ll see later - Why is this useful? A constraint for the mount-based API that we haven’t mentioned before is that it doesn’t allow safe concurrent access of a resource by two readers or writers at the same time - This poses a problem for migration, where the downtime is what should be optimized for, as the VM or app that is writing to the source device would need to be suspended before the transfer could begin - This adds very significant latency, which is a problem - The mount API was also designed in such a way as to make it hard to share a resource this way - The remote backend for example API doesn’t itself provide a mount to access the underlying data, which further complicates migration by not implementing a migration lifecycle - The migration protocol - To fix this, the migration API defines two new actors: The seeder and the leecher - The seeder represents a resource that can be migrated from/a host that exposes a migratable resource - The leecher represents a client that wants to migrate a resource to itself - Initially, the protocol starts by running an application with the application’s state on the seeder’s mount - When a leecher connects to the seeder, the seeder starts tracking any writes to it’s mount - The leecher starts pulling chunks from the seeder to it’s local backend - Once the leecher has received a satisfactory level of locally available chunks, it as the seeder to finalize, which then causes the seeder to stop the remote app, msync/flushes the drive, and returns the chunks that were changed between it started tracking and finalizing - The leecher then marks these chunks as remote, immediately resumes the VM, and queues them to be pulled immediately - By splitting the migration into these two distinct phases, the overhead of having to start the device on the leecher can be skipped and additional app initialization that doesn’t depend on the app’s state (e.g. memory allocation, connecting to databases, loading models etc.) can be performed before the application needs to be suspended - This solution combines both the pre-copy algorithm (by pulling the chunks from the seeder ahead of time) and the post-copy algorithm (by resolving dirty chunks from the seeder after the VM has been migrated) into one coherent protocol (add state machine diagram here) - This way, the maximum tolerable downtime can be drastically reduced, and dirty chunks don’t need to be re-transmitted multiple times - Effectively, it drops the maximum guaranteed downtime to the time it takes to msync the seeder’s app state, the RTT and, if they are being accessed immediately, how long it takes to fetch the chunks that were written in between starting to track and finalize - The finalization phase - A interesting question to ask with the two-step migration API is when to start the finalization step - As is visible from the migration API protocol state machine showed beforehand, the finalization stage is critical and hard or impossible to recover from depending on the implementation - While for the memory sync on its own, one could just call Finalize multiple times to restart it - But since Finalize needs to return a list of dirty chunks, it requires the VM or app on the source device to be suspended before Finalize can return - While not necessarily the case, such a suspend operation is not idempotent (since it might not just be a suspension that is required, but also a shutdown of dependencies etc.) Implementation Userfaults in Go with userfaultfd - General functionality - By listening to page faults, we know when a process wants to access a specific piece of memory - We can use this to then pull the chunk of memory from a remote, map it to the address on which the page fault occured, thus only fetching data when it is required - Usually, handling page faults is something that the kernel does - In our case, we want to handle page faults in userspace, and implement post-copy with them - In the past, this used to be possible from userspace by handling the SIGSEGV signal in the process - In our case however, we can use a recent system called userfaultfd to do this in a more elegant way (available since kernel 4.11) - userfaultfd allows handling these page faults in userspace - The region that should be handled can be allocated with e.g. mmap - Once we have the file descriptor for the userfaultfd API, we need to transfer this file descriptor to a process that should respond with the chunks of memory to be put into the faulting address - Once we have received the socket we need to register the handler for the API to use - If the handler receives an address that has faulted, it responds with the UFFDIO_COPY ioctl and a pointer to the chunk of memory that should be used on the file descriptor (code snippet from https://github.com/loopholelabs/userfaultfd-go/blob/master/pkg/mapper/handler.go) - API design for userfault-go - Implementing this in Go was quite tricky, and it involves using unsafe - We can use the syscall and unix packages to interact with ioctl etc. - We can use the ioctl syscall to get a file descriptor to the userfaultfd API, and then register the API to handle any faults on the region (code snippet from https://github.com/loopholelabs/userfaultfd-go/blob/master/pkg/mapper/register.go#L15) - Passing file descriptors between processes is possible by using a UNIX socket (code snippet from https://github.com/loopholelabs/userfaultfd-go/blob/master/pkg/transfer/unix.go) - Implementing userfaultfd backends - A big benefit of using userfaultfd and the pull method is that we are able to simplify the backend of the entire system down to a io.ReaderAt (code snippet from https://pkg.go.dev/io#ReaderAt) - That means we can use almost any io.ReaderAt as a backend for a userfaultfd-go registered object - We know that access will always be aligned to 4 KB chunks/the system page size, so we can assume a chunk size on the server based on that - For the first example, we can return a random pattern in the backend (code snippet from https://github.com/loopholelabs/userfaultfd-go/blob/master/cmd/userfaultfd-go-example-abc/main.go) - this shows a great way of exposing truly arbitrary information into a byte slice without having to pre-compute everything or changing the application - Since a file is a valid io.ReaderAt, we can also use a file as the backend directly, creating a system that essentially allows for mounting a (remote) file into memory (code snippet from https://github.com/loopholelabs/userfaultfd-go/blob/master/cmd/userfaultfd-go-example-file/main.go) - Similarly so, we can use it map a remote object from S3 into memory, and access only the chunks of it that we actually require (which in the case of S3 is achieved with HTTP range requests) (code snippet from https://github.com/loopholelabs/userfaultfd-go/blob/master/cmd/userfaultfd-go-example-s3/main.go) File-Based Synchronization - File-based synchronization - We can do this by using mmap, which allows us to map a file into memory - By default, mmap doesn’t write changes from a file back into memory, no matter if the file descriptor passed to it would allow it to or not - We can however add the MAP_SHARED flag; this tells the kernel to write back changes to the memory region to the corresponding regions of the backing file - Linux caches reads to such a backing file, so only the first page fault would be answered by fetching from disk, just like with userfaultfd - The same applies to writes; similar to how files need to be synced in order for them to be written to disks, mmaped regions need to be msynced in order to flush changes to the backing file - In order to synchronize changes to the region between hosts by syncing the underlying file, we need to have the changes actually be represented in the file, which is why msync is critical - For files, you can use O_DIRECT to skip this kernel caching if your process already does caching on its own, but this flag is ignored by the mmap - inotify vs. polling - Usually, one would use inotify to watch changes to a file - inotify allows applications to register handlers on a file’s events, e.g. WRITE or SYNC. This allows for efficient file synchronization, and is used by many file synchronization tools - It is also possible to filter only the events that we need to sync the writes, making it the perfect choice for this use case - For technical reasons however (mostly because the file is represented by a memory region), Linux doesn’t fire these events for mmaped files though, so we can’t use it - The next best option are two: Either polling for file attribute changes (e.g. last write), or by continously hashing the file to check if it has changed - Polling on its own has a lot of downsides, like it adding a guaranteed minimum latency by virtue of having to wait for the next polling cycle - This negatively impacts a maximum allowed downtime scenario, where the overhead of polling can make or break a system - Hashing the entire file is also a naturally IO- and CPU-intensive process because the entire file needs to be read at some point - Still, polling & hashing is probably the only reliable way of detecting changes to a file - Instead of hashing the entire file, then syncing the entire file, we can want to really sync only the parts of the file that have changed between two polling iterations - Detecting file changes - We can do this by opening up the file multiple times, then hashing individual offsets, and aggregating the chunks that have changed - When picking algorithms for this hashing process, the most important metric to consider is the throughput with which it can compute hashes, as well as the change of collisions - If the underlying hashing algorithm is CPU-bound, this also allows for better concurrent processing - Increases the initial latency/overhead by having to open up multiple file descriptors - But this can not only increase the speed of each individual polling tick, it can also drastically decrease the amount of data that needs to be transferred since only the delta needs to be synchronized - Hashing and/or syncing individual chunks that have changed is a common practice - Delta synchronization protocol - We have implemented a simple protocol for this delta synchronization, just like rsync’s delta synchronization algorithm (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/master/cmd/darkmagyk-orchestrator/main.go#L1337-L1411 etc.) - For this protocol specifically, we send the changed file’s name as the first message when starting the synchronization, but a simple multiplexing system could easily be implemented by sending a file ID with each message - Its intended to be simpler than rsync, and to support a central forwarding up instead of requiring P2P connectivity between each host - Defines three actors: Multiplexer, file advertiser, and file receiver - TODO: Add sequence diagram for the protocol - Multiplexer hub - Accepts TLS connections - Reads client certificate - Reads the common name from the certificate, contains the syncerID - Spawns a new goroutine for the syncerID to handle the communication with the specific syncer - In goroutine it reads the type of the peer - For the src-control peer type (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L824-L844) - Reads a file name from the syncer - Registers the connection as the one providing the file with this name - Broadcasts the file as one now being available - For the dst-control peer type (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L845-L880) - It listens to the broadcasted files from src-control peers - Relays any received file names to the dst-control peers so that it can subscribe - Also sends all currently known file names to the peer - For the dst peer type (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L882-L942) - Decodes a file name from the connection - Looks for a corresponding src-control peer - If it has found a peer, it creates and sends a new ID for this connection to the src-control peer - Waits until the src-control peer has connected to the hub with this ID with a new src-data peer by listening for src-data peer ID broadcasts - Spawns two new goroutines that copy to/from this newly created synchronization connection and the connection of the dst peer, relaying any information between the two - For the src-data peer type (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4af/cmd/darkmagyk-cloudpoint/main.go#L943-L954) - Decodes the ID for the peer - Broadcasts the ID, which allows the dst peer to continue - File advertisement (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4afe0828452c61c63233aa134fb3c2a45a1a/pkg/filetransfer/src.go#L112-L269) - Connects to the multiplexer hub - Registers itself as a src-control type of peer - Sends the file name to the multipler hub - Starts a loop to handle dst peer types - As mentioned earlier, dst peer types send a ID - Waits until ID is received, then spawns new goroutine - Connects to the multiplexer hub again, this time registering itself as a src-data peer - Opens the file that it advertises - Sends the ID that it has received back to the multiplexer hub, which connects it to the matching dst peer - Starts a new loop that starts the file transmission to the dst peer through the multiplexer - Checks if context is cancelled to stop the transmission by breaking the loop if requested - After a file transmission is done, waits for the specified polling interval, then starts a file transmission cycle again - File receiver loop (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4afe0828452c61c63233aa134fb3c2a45a1a/pkg/filetransfer/dst.go#L127-L243) - Connects to the multiplexer hub - Registers itself as the dst-control type of peer - Spawns a new goroutine after a file name has been received from the multiplexer hub - Connects to the multiplexer hub again, this time registering itself as a dst peer - Opens up the destination file - Registers itself as the dst peer - Sends the file name to the multiplexer, causing the multiplexer to look for a peer that advertises the requested file - Starts the file receiption process in a loop, and exits if a callback determines that the file has been synced completely - File transmission (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4afe0828452c61c63233aa134fb3c2a45a1a/pkg/filetransfer/src.go#L14-L110) - Does the actual file transmission part of delta synchronization - Receives the remote hashes from the mulitplexer hub - Calculates the local hashes - Compares local and remote hashes - Sends hashes that don’t match to the remote via the multiplexer hub - If the remote sent less hashes than there were locally, asks the remote to truncate it’s file to the size of the local file that is being synchronized - Loops over each of the chunks that need to be sent, and sends the updated data for the file in order - Hash calculation (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4afe0828452c61c63233aa134fb3c2a45a1a/pkg/filetransfer/hashes.go#L17-L73) - GetHashesForBlocks handles the concurrent calculation of the hashes for a file on both the file transmitter and receiver - It uses a semaphore to limit the amount of concurrent access to the file that is being hashed - Acquires a lock on the semaphore for each hash calculation worker - Each worker goroutine then opens up the file and calculates a CRC32 hash - To allow for easy transmission of the hashes over the network, it encodes them as hash values - After all hashes have been calculated, it adds them to the array and returns them - File receiption (code snippet from https://github.com/pojntfx/networked-linux-memsync/blob/159d4afe0828452c61c63233aa134fb3c2a45a1a/pkg/filetransfer/dst.go#L23-L125) - Is the receiving part of the delta synchronization algorithm - Starts by calculating hashes for the local copy of the file - Sends the local hashes to the emote, then waits until it has received the remote’s hashes - If the remote detected that the file needs to be truncated (by sending a negative cutoff value), it truncates the file - If the remote’s file has grown beyond what the local file is since the last synchronization cycle, it extends the file - It then reads from the remote in order, and copies each chunk to the matching offset in the local file FUSE Implementation in Go - Implementing a FUSE in Go means tight integration with libraries - It makes sense to divide the process into two aspects - Creating a backend for a file system abstraction API like afero.Fs - By using a file system abstraction API like afero.Fs, we can separate the FUSE implementation from the actual file system structure, making it unit testable and making it possible to add caching in user space (code snippet from https://github.com/pojntfx/stfs/blob/main/pkg/fs/file.go) - It is possible to use even very complex and at first view non-compatible backends as a FUSE file system’s backend - For example, STFS used a tape drive as the backend, which is not random access, but instead append-only and linear (https://github.com/pojntfx/stfs/blob/main/pkg/operations/update.go) - By using an on-disk index and access optimizations, the resulting file system was still performant enough to be used, and supported almost all features required for the average user - Creating an adapter between the FS abstraction API and the FUSE library’s backend interface - It is possible to map any afero.Fs to a FUSE backend, so it would be possible to switch between different file system backends without having to write FUSE-specific (code snippet from https://github.com/JakWai01/sile-fystem/blob/main/pkg/filesystem/fs.go) - While this approach is interesting, the required overhead of implementing it (which is known from prior projects like STFS[41] and sile-fystem[42]) and other factors that will be touched upon later led to it not being pursued further NBD with go-nbd - Overview - Due to the lack of pre-existing libraries, a new pure Go NBD library was implemented - This library does not rely on CGo/a pre-existing C library, meaning that a lot of context switching can be skipped - Server - The server is completely in user space, there are no kernel components involved here - The backend interface for go-nbd is very simple and only requires four methods: ReadAt, WriteAt, Size and Sync - The key difference here to the way backends were designed in userfaultfd-go is that they can also handle writes - A good example backend that maps well to a block device is the file backend (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/backend/file.go) - go-nbd exposes a Handle function to support multiple users without depending on a specific transport layer (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go) - This means that systems that are peer-to-peer (e.g. WebRTC), and thus don’t provide a TCP-style accept syscall can still be used easily - It also allows for easily hosting NBD and other services on the same TCP socket - Role of exports (https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L23-L28) - Role of options (https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L30-L35) - To make it easier to parse, the headers and other structured messages are modeled as Go structs where possible (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/protocol/negotiation.go) - Only the “fixed newstyle” handshake is implemented to keep it simple[24] - Server initiates the handshake by sending the handshake header and ignoring client flags (https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L56-L68) - The handshake is implemented using a simple for loop, which either returns on error or breaks - The server encodes/decodes messages with the binary package (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L73-L76) - Client sends options with ID - NEGOTIATION_ID_OPTION_INFO and NEGOTIATION_ID_OPTION_GO exchange the information about the chosen export (i.e. block size, export size, export name and description etc.), and if GO is specified it continues directly to the transmission phase - If an export isn’t found, the server aborts the connection (https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L102-L120) - NEGOTIATION_ID_OPTION_LIST encodes and sends the list of exports to the client - NEGOTIATION_ID_OPTION_ABORT aborts the handshake, causing the server to close the connection - The actual transmission phase is done similarly, by reading headers in a loop, switching on the message type and reading/sending the relevant data/reply - TRANSMISSION_TYPE_REQUEST_READ executes a read request on the backend and sends the relevant chunk to the client (https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L331-L351) - TRANSMISSION_TYPE_REQUEST_WRITE reads an offset and chunk from the client, then writes it to the client; if the read-only option is specified for the server, a permission error is returned to the client instead (https://github.com/pojntfx/go-nbd/blob/main/pkg/server/nbd.go#L353-L368) - TRANSMISSION_TYPE_REQUEST_DISC gracefully disconnects from the server and causes the backend to sync, i.e. to flush it’s changes to the disk - Client - The NBD client is implemented by using the kernel NBD client and a userspace component - In order to use it, one needs to find a free NBD device allocated by the kernel NBD module first - NBD devices are pre-created by the NBD kernel module and more can be specified with the nbds_max parameter - In order to find a free one, we can either specify it directly, or check whether we can find a NBD device with zero size in sysfs (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/utils/unused.go) - Client is started by calling Connect and a connection and options are provided - Role of options (https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L38-L45) - The connection is set on the kernel’s NBD device (https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L182-L189) - Relevant ioctl numbers depend on the kernel and are extracted using CGo (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/ioctl/negotiation_cgo.go) - The handshake for the NBD client is negotiated in user space by the Go program - Client also only supports the newstyle negotiation and aborts otherwise - Handshake is implented as a simple for loop, basically the same as for the server but inverted, once again switches on the types - On a NEGOTIATION_TYPE_REPLY_INFO, the client handles the export size with NEGOTIATION_TYPE_INFO_EXPORT, the name, description and block size with NEGOTIATION_TYPE_INFO_BLOCKSIZE - Client validates that the blocksize is valid (within specified bounds, and a power of two https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L261-L277) - After the metadata for the export has been fetched in the handshake, the kernel NBD client is configured with these values using ioctls (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L290-L328) - go-nbd can also list exports with List - For this, the newstyle handshake is initiated, but this time NEGOTIATION_ID_OPTION_LIST is provided - The client reads the export names from the server and disconnects (https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L393-L410) - Client Lifecycle - The DO_IT syscall never returns until it is disconnected, meaning that an external system must be used to detect whether the device is actually ready - Two ways of detecting whether the device is ready: By polling sysfs for the size parameter, or by using udev - udev manages devices in Linux - When a device becomes available, the kernel sends a udev event, which we can subscribe to and use as a reliable and idiomatic way of waiting for the ready state (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L104C10-L138) - In reality however, polling sysfs directly can be faster than subscribing to the udev event, so we give the user the option to switch between both options (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L140-L178) - Disconnecting the NBD client is also an asynchronous operation - Works by calling three ioctls: TRANSMISSION_IOCTL_CLEAR_QUE, TRANSMISSION_IOCTL_DISCONNECT and TRANSMISSION_IOCTL_CLEAR_SOCK (https://github.com/pojntfx/go-nbd/blob/main/pkg/client/nbd.go#L334-L359) - Calling these ioctls causes DO_IT to exit, thus terminating the call to Connect - Optimizations for the NBD client implementation - When opening the block device that the client has connected to, usually the kernel does provide a caching mechanism and thus requires sync to flush changes - As mentioned earlier, by using O_DIRECT however, it is possible to skip the kernel caching layer and write all changes directly to the NBD client/server - This is particularly useful if both the client and server are on the local system, and if the amount of time spent on syncing should be as small as possible - It does however require reads and writes on the device node to be aligned to the system’s page size, which is possible to implement with a client-side chunking system but does require application-specific code - Combining the NBD client and server - The server and client are connected by creating a connected UNIX socket pair (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_direct.go#L59-L62) - This exposes a path mount, which essentially returns the path to the NBD device for further abstraction layers - By building on this basic direct mount, we can add a file (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/mount/file_direct.go) mount API - The file mount API allows accessing the remote ressource with the common read/write syscalls, like a local file would be accessed - The lifecycle is integrated with Open and Close for the device with both mounts, preventing potential synchronization pitfalls over just manually implementing it in each application - The slice API is another mount API (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/mount/slice_direct.go) that mmaps the NBD device directly - Using the mmap/slice approach has a few benefits - First, it makes it possible to use the byte slice directly as though it were a byte slice allocated by make, except its transparently mapped to the (remote) backend - mmap/the byte slices also swaps out the syscall-based file interface with a truly random access one, which allows for faster concurrent reads from the underlying backend - Also has the benefit of supporting applications that aren’t build for streaming support/files, but depend on storing/fetching data in a byte slice - Alternatively, it would also be possible to format the server’s backend with a file system or the block device using standard file system tools - When the device then becomes ready, it can be mounted to a directory on the system - This way it is possible to mmap one or multiple files on the mounted file system instead of mmaping the block device directly - This allows for handling multiple remote regions using a single server, and thus saving on initialization time and overhead - Using a proper file system however does introduce both storage overhead and complexity, which is why e.g. the FUSE approach was not chosen Mounts - The ReadWriterAt pipeline - In order to implement the chunking system, we can use a abstraction layer that allows us to create a pipeline of readers/writers - the ReadWriterAt, combining an io.ReaderAt and a io.WriterAt - This way, we can forward the Size and Sync syscalls directly to the underlying backend, but wrap a backend’s ReadAt and WriteAt methods in a pipeline of other ReadWriterAts - One such ReadWriterAt is the ArbitraryReadWriterAt (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/chunks/arbitrary_rwat.go) - It allows breaking down a larger data stream into smaller chunks - In ReadAt, it calculates the index of the chunk that the offset falls into and the position within the offsets - It then reads the entire chunk from the backend into a buffer, copies the necessary portion of the buffer into the input slice, and repeats the process until all requested data is read - Similarly for the writer, it calculates the chunk’s index and offset - If an entire chunk is being written to, it bypasses the chunking system, and writes it directly to not have to unnecessarily copy the data twice - If only parts of a chunk need to be written, it first reads the complete chunk into a buffer, modifies the buffer with the data that has changed, and writes the entire chunk back, until all data has been written - This simple implementation can be used to allow for writing data of arbitrary length at arbitrary offsets, even if the backend only supports a few chunks - In addition to this chunking system, there is also a ChunkedReadWriterAt, which ensures that the limits concerning the maximum size supported by the backend and the actual chunks are being respected - This is particularly useful when the client is expected to do the chunking, and the server simply checks that the chunking system’s chunk size is respected - In order to check if a read or write is valid, it checks whether a read is done to an offset multiple of the chunk size, and whether the length of the slice of data to read/write is the chunk size (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/chunks/chunked_rwat.go) - Background pull - The Puller component asynchronously pulls chunks in the background (code snipped from https://github.com/pojntfx/r3map/blob/main/pkg/chunks/puller.go) - After sorting the chunks, the puller starts a fixed number of worker threads in the background, each of which ask for a chunk to pull - Note that the puller itself does not copy to/from a destination; this use case is handled by a separate component - It simply reads from the provided ReaderAt, which is then expected to handle the actual copying on its own - The actual copy logic is provided by the SyncedReadWriterAt instead - This component takes both a remote reader and a local ReadWriterAt - If a chunk is read, e.g. by the puller component calling ReadAt, it is tracked and market as remote by adding it to a local map - The chunk is then read from the remote reader and written to the local ReadWriterAt, and is then marked as locally available, so that on the second read it is fetched locally directly - A callback is then called which can be used to monitor the pull process - Note that if it is used in a pipeline with the Puller, this also means that if a chunk which hasn’t been fetched asynchronously yet will be scheduled to be pulled immediately - WriteAt also starts by tracking a chunk, but then immediately marks the chunk as available locally no matter whether it has been pulled before - The combination of the SyncedReadWriterAt and the Puller component implements the pull post-copy system in a modular and testable way - Unlike the usual way of only fetching chunks when they are available however, this system also allows fetching them pre-emptively, gaining some benefits of pre-copy migration, too - Using this Puller interface, it is possible to implement a read-only managed mount - This is very similar the rr+ prefetching mechanism from “Remote Regions” (reference atc18-aguilera) - Background push - Once opened, the pusher starts a new goroutine in the background which calls Sync in a set recurring interval (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/chunks/pusher.go) - Once sync is called by this background worker system or manually, it launches workers in the background - These workers all wait for a chunk to handle - Once a worker receives a chunk, it reads it from the local ReadWriter, and copies it to the remote - In order to actually mark chunks that are pushable (chunks that have been written to or have already been synced from the remote), the progress callback is used to call the pusher’s MarkOffsetPushable method (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L171C24-L185) - Only these marked chunks are being considered to be pushed when a write occurs, so as to prevent syncing chunks that have not yet been made locally available - The pusher also serves as a step in the ReadWriterAt pipeline - In order to do this, it exposes ReadAt and WriteAt - ReadAt is a simple proxy, while WriteAt also marks a chunk as pushable (since it mutates data) before writing to the local ReadWriterAt - For a read-only scenario, the Pusher step is simply skipped (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L142-L169) - Benefits of managed mounts over direct mounts - For the direct mount, the NBD server was directly connected to the remote, while in this setup a pipeline of pullers, pushers, a syncer and an ArbitraryReadWriter is used (graphic of the four systems and how they are connected to each other vs. how the direct mounts work) - Using this simple interface also makes the entire system very testable - In the tests, a memory reader or file can be used as the local or remote ReaderWriterAts and a simple table-driven test can be created (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/chunks/puller_test.go) - Thanks to making these individual components (background pull, push, chunking, chunk validation) unit-testable on their own, edge cases (like different pull heuristics) can be tested easily, too - Parallelizing NBD device initialization - The background push/pull system allows pulling from the remote ReadWriterAt before the NBD device is open - This is possible because the device doesn’t need to start accessing the data before it can start pulling - As a result we can start pulling in the background as the NBD client and server are still starting (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L251-L272) - These two components typically start fairly quickly, but can still take multiple ms - Often, it takes as long as one RTT, so parallelizing this startup process can significantly reduce the initial latency and pre-emptively pull quite a bit of data - If no background pulls are enabled, the creation of the Puller is simply skipped (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/mount/path_managed.go#L187-L222) - Implementing device lifecycles - Similarly to how the direct mounts API used the basic path mount to build the file and mmap interfaces, the managed mount builds on this interface in order to provide the same interfaces - It is however a bit more complicated for the lifecycle to work - For example, in order to allow for a Sync() API, e.g. the msync on the mmaped file must happen before Sync() is called on the syncer - This is done through a hooks system (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/mount/file_managed.go#L34-L37) - The same hooks system is also used to implement the correct lifecycle when Closeing the mount - With all of these components in place, the managed mounts API serves as a fast and efficient option to access almost any remote resource in memory - Optimization for WAN - Another optimization that has been made to support this WAN deployment scenario is the pull-only architecture - Usually, a pre-copy system pushes changes to the destination in the migration API - This however makes such a system hard to use in a scenario where NATs exist, or a scenario in which the network might have an outage during the migration - With a pull-only system emulating the pre-copy setup, the client can simply keep track of which chunks it still needs to pull itself, so if there is a network outage, it can just resume pulling like before, which would be much harder to implement with a push system as the server would have to track this state for multiple clients and handle the lifecycle there - The pull-only system also means that unlike the push system that was implemented for the hash-based synchronization, a central forwarding hub is not necessary - In the push-based system for the hash-based solution, the topology had to be static/all destinations would have to have received the changes made to the remote app’s memory since it was started - In order to cut down on unnecessary duplicate data transmissions, a central forwarding hub was implemented - This central forwarding hub does however add additional latency, which can be removed completely with the migration protocol’s P2P, pull-only algorithm Live Migration - Overview - As mentioned in Pull-Based Synchronization with Migrations before, the mount API is not optimal for a migration scenario - Splitting the migration into two separate phases can help a lot to fix the biggest problem, the maximum guaranteed downtime - The flexible architecture of the ReadWriterAt components allow the reuse of lots of code for both the mount API and the migration API - Implementing the seeder - The seeder defines a simple read-only RPC API with the familiar ReadAt methods, but also new APIs such as returning dirty chunks from Sync and adding a Track method (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder.go#L15-L21) - Unlike the remote backend, a seeder also exposes a mount through a path, file or byte slice, so that as the migration is happening, the underlying data can still be accessed by the application - This fixes the issue that the mount API had for migrations, where only one end of the API exposed it’s mount, while the other part simply served data - The tracking aspect is implemented in the same modular and composable way as the syncer etc. - by using a TrackingReadWriterAt that is connected to the seeder’s ReadWriterAt pipeline - Once activated by Track, the tracker intercepts all WriteAt calls and adds them to a local de-duplicated store (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/chunks/tracking_rwat.go#L28-L40) - When Sync is called, the changed chunks are returned and the de-duplicated store is cleared - Insert graphic of the pipeline here - A benefit of the protocol being defined in such a way that only the client ever calls an RPC, not the other way around, is that the transport layer/RPC system is completely interchangeable - This works by returning a simple abstract service utility struct struct from Open (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder.go) - This abstract struct can then be used as the backend for any RPC framework, e.g. for gRPC (code snippet https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder_grpc.go) - Implementing the leecher - The leecher then takes this abstract API as an argument - As soon as the leecher is opened, it calls track on the seeder, and in parallel starts the device (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/migration/path_leecher.go#L144-L203) - The leecher introduces a new component, the LockableReadWriterAt, into its internal pipeline (add graphic of the internal pipeline) - This component simply blocks all read and write operations to/from the NBD device until Finalize has been called (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/chunks/lockable_rwat.go#L19-L37) - This is required because otherwise stale data (since Finalize did not yet mark the changed chunks) could have poisoned the cache on the e.g. mmaped device - Once the leecher has started the device, it sets up a syncer (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/migration/path_leecher.go#L214-L252) - A callback can be used to monitor the pull process (code snippet from https://github.com/pojntfx/r3map/blob/main/cmd/r3map-migration-benchmark/main.go#L544-L548) - As described before, after a satisfactory local availability level has been reached, Finalize can be called - Finalize then calls Sync() on the remote, marks the changed chunks as remote, and schedules them also be pulled in the background (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/migration/path_leecher.go#L257-L280) - As an additional measure aside from the lockable ReadWriterAt to make accessing the path/file/slice too early harder, only Finalize returns the managed object, so that the happy path can less easily lead to deadlocks - After a leecher has successfully reached 100% local availability, it calls Close on the seeder and disconnects the leecher from the seeder, causing both to shut down (code snippet from https://github.com/pojntfx/r3map/blob/main/cmd/r3map-migration-benchmark-server/main.go#L137) - Once the leecher has exited, a seeder can be started, to allow for migrating from the destination to another destination again Pluggable Encryption and Authentication - Compared to existing remote mount and migration solutions, r3map is a bit special - As mentioned before, most systems are designed for scenarios where such resources are accessible in a high-bandwidth, low-latency LAN - This means that some assumptions concerning security, authentication, authorization and scalability were made that can not be made here - For example encryption; while for a LAN deployment scenario it is probably assumed that there are no bad actors in the subnet, the same can not be said for WAN - While depending on e.g. TLS etc. for the migration could have been an option, r3map should still be useful for LAN migration use cases, too, which is why it was made to be completely transport-agnostic - This makes adding encryption very simple - E.g. for LAN, the same assumptions that are being made in existing systems can be made, and fast latency-sensitive protocols like the SCSI RDMA protocol (SRP) or a bespoke protocol can be used - For WAN, a standard internet protocol like TLS over TCP or QUIC can be used instead, which will allow for migration over the public internet, too - For RPC frameworks with exchangeable transport layers such as dudirekta (will be explained later), this also allows for unique migration or mount scenarios in NATed environments over WebRTC data channels, which would be very hard to implement with more traditional setups - Similarly so, authentication and authorization can be implemented in many ways - While for migration in LAN, the typical approach of simply trusting the local subnet can be used, for public deployments mTLS certificates or even higher-level protocols such as OIDC can be used depending on the transport layer chosen - For WAN specifically, new protocols such as QUIC allow tight integration with TLS for authentication and encryption - While less relevant for the migration use case (since connections can be established ahead of time), for the mount use case the initial remote ReadAt requests’ latency is an important metric since it strongly correlates with the total latency - QUIC has a way to establish 0-RTT TLS, which can save one or multiple RTTs and thus signficantly reduce this overhead, and handle authentication in the same step Optimizing Backends For High RTT - In WAN, where latency is high, the ability to fetch chunks concurrently is very important - Without concurrent background pulls, latency adds up very quickly as every memory request would have at least the RTT as latency - The first prerequisite for supporting this is that the remote backend has to be able to read from multiple regions without locking the backend globally - For the file backend for example, this is not the case, as the lock needs to be acquired for the entire file before an offset can be accessed (code snippet from https://github.com/pojntfx/go-nbd/blob/main/pkg/backend/file.go#L17-L25) - For high-latency scenarios, this can quickly become a bottleneck - While there are many ways to solve this, one is to use the directory backend - Instead of using just one backing file, the directory backend is a chunked backend that uses a directory with one file for each chunk instead of a global file - This means that the directory backend can lock each file individually, speeding up concurrent access - This also applies to writes, where even concurrent writes to different chunks can be done at the same time as they are all backed by a separate file - The directory backend keeps track of these chunks by using an internal map of locks (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/backend/directory.go#L22-L24) - When a chunk is first accessed, a new file is created for the chunk (code snipped from https://github.com/pojntfx/r3map/blob/main/pkg/backend/directory.go#L77-L94) - If the chunk is being read, the file is also truncated to one chunk length - Since this could easily exhaust the number of maximum allowed file descriptors for a process, a check is added (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/backend/directory.go#L55-L75) - If the maximum allowed number of open files is exceeded, the first file is closed and removed from the map, causing it to be reopened on a subsequent read Using Remote Stores as Backends - Using key-value stores as ephemeral mounts - RPC backends provide a way to access a remote backend - This is useful, esp. if the remote resource should be protected in some way or if it requires some kind of authorization - Depending on the use case however, esp. for the mount API, having access to a remote backend without this level of indirection can be useful - Fundamentally, a mount maps fairly well to a remote random-access storage device - Many existing protocols and systems provide a way to access essentially this concept over a network - One of these is Redis, an in-memory key-value store with network access - Chunk offsets can be mapped to keys, and bytes are a valid key type, so the chunk itself can be stored directly in the KV store (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/backend/redis.go#L36-L63) - Using Redis is particularly useful because it is able to handle the locking server-side in a very efficient way, and is well-tested for high-throughput etc. scenarious - Redis also has very fast read/write speeds due to its bespoke protocol and fast serialization - Authentication can also be handled using the Redis protocol, so can multi-tenancy by using multiple databases or a prefix - Mapping large resources into memory with S3 - While the Redis backend is very useful for read-/write usage, when deployment to the public internet is required, it might not be the best one - The S3 backend is an good choice for mapping public information, e.g. media assets, binaries, large read-only filesystems etc. into memory - S3 used to be an object storage service from AWS, but has since become a more or less standard way for accessing blobs thanks to open-source S3 implementations such as Minio - Similarly to how files were used as individual files, one S3 object per chunk is used to store them - S3 is based on HTTP, and like the Redis backend requires chunking due to it not supporting updates of part of a file - In order to prevent having to store empty data, the backend interprets “not found” errors as empty chunks - Document databases as persistent mount remotes - Another backend option is a NoSQL server such as Cassandra - Specifically, as noted earlier, ScyllaDB - which improves upon Cassandra’s latency, which is important for the mount API - This is more of a proof of concept than a real usecase, but shows the versitility and flexibility of how a database can be mapped to a memory region, which can be interesting for accessing e.g. a remote database’s content without having to use a specific client - ReadAt and WriteAt are implemented using Cassandra’s query language (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/backend/cassandra.go#L36-L63) - Similarly to Redis, locking individual keys can be handled by the DB server - But in the case of Cassandra, the DB server stores chunks on disk, so it can be used for persistent data - In order to use Cassandra, migrations have to be applied for creating the table etc. (code snippet from https://github.com/pojntfx/r3map/blob/main/cmd/r3map-direct-mount-benchmark/main.go#L369-L396) Bi-Directional and Concurrent RPCs with Dudirekta - Overview of the framework and why a custom one was implemented - Designed specifically with the hybrid pre- and post-copy scenario in mind - Support for concurrent RPCs allows for efficient background pulls since multiple chunks can be pulled at the same time - Bi-directional API makes it possible to initiate pre-copy migrations and transfer chunks from the source host without having to make the destination host dialable - Dudirekta supports defining functions on both the client and the server - This is very useful for implementing e.g. a pre-copy protocol where the source pushes chunks to the destination by simply calling a RPC on the destination, instead of the destination calling a RPC on the source - Usually, RPCs don’t support exposing or calling RPCs on the client, too, only on the server - This would mean that in order to implement a pre-copy protocol with pushes, the destination would have to be dialable from the source - In a LAN scenario, this is easy to implement, but in WAN it is complicated and requires authentication of both the client and the server - Transport independence allows for i.e. QUIC to be used for the 0-RTT handshake in WAN - Support for passing callbacks and closures to RPCs makes it possible to model remote generators and yields to easily report i.e. remote migration progress - Because it reflection-based, both RPC definition and calling an RPC are both transparent, making it optimal for prototyping r3map - Usage (as described in Dudirekta README) - Protocol definition (JSONL, structure etc.) and how it is multiplexed - RPC provider implementation - If an RPC, such as ReadAt, is called, it is looked up via reflection and validated (code snippet from https://github.com/pojntfx/dudirekta/blob/main/pkg/rpc/registry.go#L323-L357) - The arguments, which have been supplied as JSON, are then unmarshalled into their native types, and the local wrapper struct’s method is called in a new goroutine (code snippet from https://github.com/pojntfx/dudirekta/blob/main/pkg/rpc/registry.go#L417-L521) - RPC call implementation - To use a RPC backend on the destination side, the wrapper struct’s remote representation is used (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/services/backend.go#L14-L19) - For the destination site, the remote representation’s fields are iterated over, and replaced by functions which marshal and unmarshal the function calls into the dudirekta JSON protocol (code snippet from https://github.com/pojntfx/dudirekta/blob/main/pkg/rpc/registry.go#L228-L269) - To do this, the arguments are marshalled into JSON, and a unique call ID is generated (code snippet from https://github.com/pojntfx/dudirekta/blob/main/pkg/rpc/registry.go#L109-L130) - Once the remote has responded with a message containing the unique call ID, it unmarshalls the arguments, and returns (code snippet from https://github.com/pojntfx/dudirekta/blob/main/pkg/rpc/registry.go#L145-L217) - Closure implementation - Using Dudirekta for r3map - Since dudirekta has a few limitations (such as the fact that slices are passed as copies, not references, and that context needs to be provided), the resulting remote struct can’t be used directly - To work around this, the standard go-nbd backend interface is implemented for the remote representation, creating a universally reusable, generic RPC backend wrapper (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/backend/rpc.go) - The same backend implementation is also done for the seeder protocol (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder.go) Connection Pooling with gRPC - Drawbacks of Dudirekta - While the dudirekta RPC serves as a good reference implementation of the basic RPC protocol, it does not scale particularly well - This mostly stems from two aspects of how it is designed - JSON(L) is used for the wire format, which while simple and easy to analyze, is slow to marshal and unmarshal - Dudirekta’s bi-directional RPCs do however come at the cost of not being able to do connection pooling, since each client dialing the server would mean that the server could not reference the multiple client connections as one composite client without changes to the protocol - While implementing such a pooling mechanism in the future could be interesting, it turned out to not be necessary thanks to the pull-based pre-copy solution described earlier - Instead, only calling RPCs exposed on the server from the client is the only requirement for an RPC framework, and other, more optimized RPC frameworks can already offer this - Switching to code generation - Dudirekta uses reflection to make the RPCs essentially almost transparent to use - By switching to a well-defined protocol with a DSL instead, we can gain further benefits from not having to use reflection and generating code instead - One popular such framework is gRPC - gRPC is a high-performance RPC framework based on Protocol Buffers - Because it is based on Protobuf, we can define the protocol itself in the proto3 DSL - The DSL also allows to specify the order of each field in the resulting wire protocol, making it possible to evolve the protocol over time without having to break backwards compatibility - This is very important given that r3map could be ported to a language with less overhead in the future, e.g. Rust, and being able to re-use the existing wire protocol would make this much easier - Benefits of gRPC over Dudirekta - While dudirekta is a simple protocol that is easy to adapt for other languages, currently it only supports Go and JS, while gRPC supports many more - A fairly unique feature of gRPC are streaming RPCs, where a stream of requests can be sent to/from the server/client, which, while not used for the r3map protocol, could be very useful to implementing a pre-copy migration API with pushes similarly to how dudirekta does it by exposing RPCs from the client - As mentioned before, for WAN migration or mount scenarios, things like authentication, authorization and encryption are important, which gRPC is well-suited for - Protobuf being a proper byte-, not plaintext-based wire format is also very helpful, since it means that e.g. sending bytes back from ReadAt RPCs doesn’t require any encoding (wereas JSON, used by dudirekta, base64 encodes these chunks) - gRPC is also based on HTTP/2, which means that it can benefit from existing load balancing tooling etc. that is popular in WAN for web uses even today - Implementation of the gRPC backend - This backend is implemented by first defining the protocol in the DSL (code snippet from https://github.com/pojntfx/r3map/blob/main/api/proto/migration/v1/seeder.proto) - After generating the bindings, the generated backend interface is implemented by using the dudirekta wrapper struct as the abstraction layer (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder_grpc.go) - Unlike dudirekta, gRPC also implements concurrent RPCs and connection pooling - Similarly to how having a backend that allows concurrent reads/writes can be useful to speed up the concurrent push/pull steps, having a protocol that allows for concurrent RPCs can do the same - Connection pooling is another aspect that can help with this - Instead of either using one single connection with a multiplexer (which is possible because it uses HTTP/2) to allow for multiple requests, or creating a new connection for every request, gRPC is able to intelligently re-use existing connections for RPCs or create new ones, speeding up parallel requests Optimizing RPC Throughput and Latency with fRPC - Despite these benefits, gRPC is not perfect however - Protobuf specifically, while being faster than JSON, is not the fastest serialization framework that could be used - This is especially true for large chunks of data, and becomes a real bottleneck if the connection between source and destination would allow for a high throughput - This is where fRPC, a RPC library that is easy to replace gRPC with, becomes useful - Because throughput and latency determine the maximum acceptable downtime of a migration/the initial latency for mounts, choosing the right RPC protocol is an important decision - fRPC also uses the same proto3 DSL, which makes it an easy drop-in replacement, and it also supports multiplexing and connection polling - Because of these similarities, the usage of fRPC in r3map is extremely similar to gRPC (code snippet from https://github.com/pojntfx/r3map/blob/main/pkg/services/seeder_frpc.go) Results Testing Environment - Hardware specifications - Benchmark scripts are reproducible (add link here) - Multiple runs were done for each measurement to ensure consistenly Access Methods - Latency for technologies - Compared to disk and memory, all three other access methods (userfaultfd, direct mounts, managed mounts) are slow - Latency of accessing a chunk on userfaultfd is 15x slower than on disk - Latency of accessing a chunk on direct mounts is 28x slower than on disk, almost double the latency of userfaultfd - Managed mounts is 40x slower than disk - It is however important to consider the total scale: All of these are in the scale of µs - Interesting to see that managed mounts have a significantly higher latency - Looking at the distribution, we can see a similar pattern - Spread for managed mount is the smallest, but there are significant outliers up until 1750 µs - Direct mounts has the highest spread of the access methods - Up until now we took a look at 0ms RTT, meaning that we connected the backend to the mount/userfault handler directly - If we take a look at how it behaves as the RTT changes, we get a different story - Both userfaultfd and direct mounts first chunk latency grows linearly - For managed mount, the latency gets slightly higher up to a RTT of 25ms, but even then max. 1/2 the RTT of the latency that is expected compared to the direct mount of userfaultfd - After 25ms, the first chunk latency is the same or lower as it was for 0 RTT - Similar pattern if workers are tuned for managed mounts - For zero workers, latency grows almost linearly with the RTT - If more than 0 workers are used, the latency keeps growing almost linearly, until it peaks at 10ms and then drops down to the pre-0ms value - Throughput for technologies - For throughput of the memory, disk, userfaultfd, direct mounts and managed mounts access methods, we can see similar trends but some distinctions - Memory has the highest throughput at over 20 GB/s - This however is not followed by the disk as it was in latency, but rather by the direct mounts (at 2.8GB/s) and managed mounts (2.4GBs) - The disk is at 2.3GB/s, while userfaultfd falls behind at 450 MB/s - If we only look at userfaultfd, managed mounts and direct mounts this different between the access methods becomes more clear - When it comes to the throughput distribution, userfaultfd has the lowest spread, while managed mounts has the highest, closely followed by direct mounts - Interestingly, the median of direct mounts is especially high compared to the other access methods - As with latency, the results are very different if the RTT increases - Direct mounts and userfaultfd drop to below 10MB/s and 1MB/s respectively after the RTT reaches 6ms and continue to drop as the RTT increases - Managed mounts also drop, but much less drastically, even at a RTT of 25ms the throughput is still at over 500MB/s - If the RTT is lower than 10ms, almost 1GB/s can still be achieved with managed mounts - Similar results to what was measured with latency can be seen with different pull worker counts for the managed mount as RTT increases - Low worker counts have good performance at 0ms RTT - As RTT increases, the higher the pull worker count, the higher the throughput continues to be - For 16384 workers, throughput can be kept at consistently over 1GB/s even at 30ms latency - Write throughput for technologies - So far we’ve looked at read throughput - For a migration/remote access usecase that mostly makes sense - But for completeness, data for write throughput is also available - Note that the file is opened with O_DIRECT for this benchmark, causing additional overhead when writing, but also - for benchmark purposes - don’t require msync - As RTT increases, managed mounts have much better write performance - Write speed for direct mounts drop down to below 1 MB/s after 4ms, while they are consistenly above 230 MB/s for managed mounts, independent of RTT Initialization - Mount initialization time - There are two diffent ways to check for whether a device is ready: Polling or subscribing to the udev events - Initialization time for udev is higher on average than the polling method - Spread is similar for both methods - Pre-emptive pulls - When looking at pre-emptive pulls as RTT increases, the role of workers becomes apparent - The higher the worker count is, the more data can be pulled - While for 4096 workers, almost 40 MB can be pulled pre-emptively at 7ms, this drops to 20 MB for 2048 workers, 5 MB for 512 workers and continues to drop - Even for a RTT of 0ms, more workers mean more pre-emptively pulled data in general Chunking - Server vs. client-side chunking - Chunking can be done on either client- or server-side for both direct and managed mount - It is clear that unless the RTT is 0, managed mounts yield significantly higher throughput than direct mounts of both server- and client side chunking - When looking at the direct mounts, server-side chunking is very fast for 0ms RTT at almost 500 MB/s throughput, but drops very quickly to 75MB/s at 1ms, 40MB/s at 2ms, and continues to drop down to just over 5MB/s at 20ms RTT - For client-side chunking, the result is lower at just 30MB/s even at a 0ms RTT, and then continues to drop steadily it reaches 4.5MB/s at 20ms RTT - Looking at managed mounts the scenario is very different - Throughput also declines as RTT increases, but less drastically - Server-side chunking has much higher throughputs, at 450MB/s at 0ms RTT vs. 230MB/s at 0ms for the client-side chunking - As RTT increases, both direct and managed backends drop following a similar pattern, dropping to 300 MB/s for managed mounts and 240MB/s for direct mounts at 20ms RTT RPC frameworks - Throughput for RPC frameworks - Looking at the performance for dudirekta, gRPC and fRPC for managed and direct mounts, quite drastic throughput differences can be seen as the RTT increases - While for 0 RTT, the direct mounts provide the best throughput in line with the measurements for the different access technologies, the throughput drops drastically as RTT increases compared to managed mounts - In general, dudirekta has much lower throughput than both gRPC and fRPC - When looking at direct mounts specifically, the sharp difference between dudirekta and gRPC/RPC is visible - While at 0 RTT, fRPC is at 390MB/s and gRPC is at 500MB/s, dudirekta reaches just 50MB/s - At 2ms, throughput for all RPC frameworks drop drastically as the RTT increases, with fRPC and gRPC both reaching 40MB/s, and Dudirekta dropping to just 20MB/s - All three RPC frameworks reach a throughput of just 7 MB/s at 14ms, and then continue to decrease until they reach 3 MB/s at 40ms - For managed mounts, there is a similarly stark difference between dudirekta and the other RPC frameworks - Dudirekta’s throughput stays consistently low, at an average of 45MB/s, interestingly performing better in the >20ms RTT section - Like in direct mounts, gRPC manages to outperform fRPC at 0ms RTT at 395MB/s vs. 250MB/s, but fRPC gets consistenly higher values starting at 2ms RTT - fRPC has a better throughput than gRPC on average, managing to keep above 300MB/s throughput until 25ms RTT, while gRPC drops to below this after 14ms RTT - Average throughput of fRPC is higher, but as RTT reaches 40 it becomes less drastic at a max. of 50MB/s difference after 28ms RTT is reached Backends - Latency for backends - Average first chunk latency for memory, file, directory, Redis and Cassandra backends at 0ms RTT are all similar and within µs range - The latency for accessing Redis is the lowest at 2.5µs - Looking at the latency distribution, Redis once again is visible as having both the smallest amount of spread and the lowest amount of latency - Memory and S3 stick out for having a low amount of outlyers when it comes to first chunk latency, while the directory backend is notable for it’s significant spread - Throughput for backends - When looking at throughput, the backends have significantly more different characteristics compared to latency - File and memory consistently have high throughput - For direct mounts, file throughput is higher than memory througput at 2081 MB/s vs 1630MB/s on average respectively - For managed mounts, this increases to 2372MB/s vs. 2178MB/s respectively - When comparing the direct vs. managed mount values, Cassandra and the remote backends in general show vast differences in throughput - Cassandra reaches almost 700MB/s for a managed mount, while it falls to 3MB/s when using a direct one - Similarly so, for Redis and the directory backend, direct mounts are roughly 3.5 times slower than the managed mounts, while for S3 it is 5.5 times slower - When looking at the throughput for the direct mounts with the option of having remote access, Redis has a much higher average throughput at 114MB/s compared to both Cassandra at 3MB/s and S3 at 8MB/s - For managed mounts, Cassandra performs better than both S3 and Redis, managing 689MB/s compared to 439MB/s for Redis and 44 MB/s for S3 - The throughput distribution for the different backends with direct mounts is similarly interesting - A logarithmic Y axis is used to show the kernel density estimation for Redis, which has a much wider spread compared to Redis and S3, but is also consistently higher than both even with this spread - For managed mounts, Cassandra has a high spread but also the highest throughput, while Redis has the lowest - S3 is also noticable here again for having a consistently lower througput compared to both alternatives, with an average spread - When looking at average throughput for direct mounts, all backends see their throughput drop drastically as RTT increases - Memory and file have vey high throughputs above 1.4GB/s at 0MS RTT, while Redis achieves just 140MB/s as the closest network-capable alternative - Directory noticably has a lower throughput than Redis, despite not being network-capable - All other technologies are at below 15MB/s for direct mounts, even at 0ms RTT, and all technologies towards below 3MB/s at 40ms for direct mounts - When looking at the network-capable backends in isolation, the difference between Redis and the others in direct mount performance is striking, but all trend towards low throughput as the RTT increases - For managed mounts, the memory and file backends again outperform all other options at over 2.5GB/s, while the closest network-capable technology reaches just 660MB/s - Similarly to the latency, all technologies trend towards a similar throughput as RTT increases, with sharp drops for the memory and file backends after 2ms - Noticably, both the directory and S3 backends underperform, even for managed mounts, dropping down to just 55MB/s at 40ms RTT - When looking at just the network-capable backends, it is clear again that S3 underperforms drastically - Both Redis and Cassandra start between 550-660MB/s at 0 RTT, and then begin to drop after 6ms RTT until the drop to 170-180MB/s at 40ms RTT, with Redis consistently being slightly above the throughput for Cassandra Discussion Userfaults - By using userfaultfd we are able to map almost any object into memory - This approach is very clean and has comparatively little overhead, but also has significant architecture-related problems that limit its uses - The first big problem is only being able to catch page faults - that means we can only ever respond the first time a chunk of memory gets accessed, all future requests will return the memory directly from RAM on the destination host - This prevents us from using this approach for remote resources that update over time - Also prevents us from using it for things that might have concurrent writers/shared resources, since there would be no way of updating the conflicting section - Essentially makes this system only usable for a read-only “mount” of a remote resource, not really synchronization - Only a viable solution for post-copy migration - Also prevents pulling chunks before they are being accessed without additional layers of indirection - The userfaultfd API socket is also synchronous, so each chunk needs to be sent one after the other, meaning that it is very vulnerable to long RTT values - Also means that the initial latency will be at minimum the RTT to the remote source, and (without caching) so will be each future request - Has faster acccess latency compared to direct mounts and managed mounts at 0 RTT - Latency grows with RTT linearly as with direct mounts because there are no background pulls and no pre-emptive pulls like for managed mounts - Significantly lower througput than direct mounts and managed mounts, esp. compared to managed mounts due to linear access - Has a lower spread than direct mounts and managed mounts, but even with the more predictable throughput, its still consistently slower than both - For high RTT, it becomes essentially unusable - In summary, while this approach is interesting and very idiomatic to Go, for most data, esp. larger datasets and in high-latency scenarios/in WAN, we need a better solution File-Based Synchronization - Similarly to userfaultfd, this system also has limitations - While userfaultfd was only able to catch reads, this system is only able to catch writes to the file - Only a viable solution for pre-copy migration - Essentially this system is write-only, and it is very inefficient to add hosts to the network later on - As a result, if there are many possible destinations to migrate state too, a star-based architecture with a central forwarding hub can be used - The static topology of this approach can be used to only ever require hashing on one of the destinations and the source instead of all of them - This way, we only need to push the changes to one component (the hub), instead of having to push them to each destination on their own - The hub simply forwards the messages to all the other destinations - Thanks to this support for a star-based archicture, file-based synchronization might be a good choice for highly throughput-constrained networks, where the potentially higher maximum downtime is acceptable FUSE - FUSE can provide both a solution for pre- and post-copy migration - FUSE also has downsides - It operates in user space, which means that it needs to do context switching, esp. compared to a file system in kernel space - Some advanced file system features aren’t available for a FUSE - The overhead of FUSE (and implementing a completely custom file system) for synchronizing memory is significant - The optimal solution would be to not expose a full file system to track changes, but rather a single file - As a result of this, the significant implementation overhead of such a file system led to it not being chosen, since NBD is available as an alternative Direct Mounts - Have a high spread/are unpredictable when from a first chunk latency perspective at 0ms RTT, but is more predictable when it comes to throughput - First chunk latency grows linearly like userfaultfd as RTT increases, because there are no pre-emptive pulls - Has the highest throughput at 0ms RTT, even higher than managed mounts, because the duplicate I/O operations for the background pull are not necessary - Throughput doesn’t drop as rapidly as userfaultfd, but still significantly because there is no background pulls, all data needs to be accessed one by one - Write speed is subpar because writes need to be delivered directly remotely, as there are no background pushes compared to managed mounts - Is a good solution if the internal overhead of managed mounts is higher than the overhead of the RTT, which can be the case in LAN or other low-latency networks Managed Mounts - Have an internal overhead due to duplicate I/O operations for background pull, resulting in worse throughput for low (esp. 0ms) RTT scenarios and higher inital latencies - As soon as the RTT reaches levels more typical for a WAN deployment, this becomes - Tuning the background workers can substantially increase the throughput values, since data can be accessed in parallel - Pull priority function can allow for even more optimized pulls - Pre-emptive pulls can significantly reduce initial chunk latency, since multiple MB can be pulled before the device is open - Higher worker counts can significanlty increase the amount of chunks that can be pre-emptively pulled - Generally speaking, polling is almost always the better choice for an initialization algorithm in order to reduce the overall Open() time and thus reduce the overhead of mounting a remote ressource - Write throughput for managed mounts is significantly better than that of the direct mounts, since writes only go to the local backend, and are then asynchronously synced to the remote using the periodic background push system - Is a good solution for WAN deployments, as soon as the RTT increases to a point that is high enough for it to balance out the internal overhead of the duplicate I/O operations Chunking - In general, server-side chunking should always be preferred due to the much better throughput - For direct mounts, due to the linear access, the throughput is low for both server- and client-side chunking due to linear access as RTT increases, but server-side chunking is much more performant in this circumstance for low RTTs - For managed mounts, client-side chunking can half the throughput of a mount compared to server-side chunking, mostly because if the chunks are smaller than the NBD block size, it decreases the amount of chunks that can be pulled relative if the worker count stays the same, while server-side chunking doesn’t require and extra worker on the client for each extra chunk that needs to be pulled, making it possible for the background pull system to pull more, increasing the throughput RPC Frameworks - Dudirekta - Consistently has lower throughput than the alternaties - Performs better for managed mounts than direct mounts thanks to support for concurrent RPCs - Less sensistive to RTT compared to gRPC and fRPC for managed mounts - Even with managed mounts, much worse throughput compared to both due to no connection pooling support - Remains a good option for prototyping due to lower development overhead and transport layer independence - gRPC - Considerably faster than Dudirekta for managed and direct mounts - Has support for connection pooling, giving it a significant performance benefit over Dudirekta for managed mounts - Very good performance for direct mounts with 0ms RTT - Is an industry standard, and has good tooling and known scalability characteristics - fRPC - Has consistently higher throughput than gRPC in managed mounts due to internal optimizations - Is faster than Dudirekta in both direct mounts and managed mounts due to support for connection pooling - Is much more minimal than gRPC/Protocol Buffers, making it more maintainable - Has less enterprise testing and tooling available, making it a more performant but also less proven solution Backends - Redis - Has the smallest amount of spread and lowest amount of initial chunk latency at 0ms RTT - Direct mounts have a lower throughput compared to managed mounts, showing good support for concurrent chunk access - Has the highest throughput for direct mounts by a significant margin due to its optimized protocol and fast key lookups - Is also the fastest network-capable backend for managed mounts, once again due to the good support for concurrent access - Good choice for ephermeral data like caches, where speed or the need for the direct mount API is important - Cassandra - Has highest throughput in 0ms RTT deployments for direct mounts, showing very good concurrent access performance - Falls short when it comes to usage in direct mounts, where the performance is worse than any other backend, showing a high read latency overhead for looking up individual chunks - As RTT increases, has only slightly less throughput compared to Redis - Is a good choice if most data will be accessed by the managed mount’s background pull system, but a bad choice if chunks will be accessed outside of this due to the low direct mount throughput - Is a good choice if the data should be persistent, since Cassandra can provide varying consistency guarantees, unlike Redis - S3 - Has the lowest performance of the network-capable backends that were implemented for managed mounts - Performance for managed mounts is consistently low even as RTT increases, presumably due to the high overhead of having to make HTTP requests to retrieve individual chunks - Does perform better than Cassandra for direct mounts - Is a good choice if S3 is the only choice due to architectural constraints or if the chance of chunks being read outside of the managed mount’s background pull system is high, where Cassandra has worse throughput NBD - Limitations of NBD and ublk as an alternative - NBD is the underlying technology for both direct and managed mounts - NBD is a battle-tested solution for this with fairly good performance - In the future a more lean implemenation called ublk could also be used - ublk uses io_uring, which means that it could potentially allow for much faster concurrent access - It is similar to NBD; it also uses a user space server to provide the block device backend, and a kernel ublk driver that creates /dev/ublkb* devices - Unlike as it is the case for the NBD kernel module, which uses a rather slow UNIX or TCP socket to communicate, ublk is able to use io_uring pass-through commands - The io_uring architecture promises lower latency and better throughput - Because it is however still experimental and docs are lacking, NBD was chosen Language Limitations - Issues with the r3map implementation - While the managed mounts API mostly works, there are some issues with it being implemented in Go - This is mostly due to deadlocking issues; if the GC tries to release memory, it has to stop the world - If the mmap API is used, it is possible that the GC tries to manage the underlying slice, or tries to release memory as data is being copied from the mount - Because the NBD server that provides the byte slice is also running in the same process that is consuming the byte slice, this causes a deadlock as the server that provides the backend for the mount is also frozen (https://github.com/pojntfx/r3map/blob/main/pkg/mount/slice_managed.go#L70-L93) - A workaround for this is to lock the mmaped region into memory, but this will also cause all chunks to be fetched, which leads to a high Open() latency - This is fixable by simply starting the server in separate process or other context where the GC does not cause it to stop, and then mmaping - Issues like this however are hard to fix, and point to Go potentially not being the correct language to use for this part of the system - In the future, using a language without a GC (such as Rust) could provide a good alternative - While the current API is Go-specific, it could also be exposed through a different interface to make it usable in Go Use Cases Remote Swap With ram-dl - ram-dl is a fun experiment - Is a tech demo for r3map - Uses the fRPC backend to expand local system memory - Allows mounting a remote system’s RAM locally - Can be used to inspect a remote system’s memory contents - Is based on the direct mount API - Uses mkswap, swapon and swapoff (code snippet from https://github.com/pojntfx/ram-dl/blob/main/cmd/ram-dl/main.go#L170-L190) - Enables paging out to the block device provided by the direct mount API - ram-ul “uploads” RAM by exposing a memory, file or directory-backed file over fRPC - ram-dl then does all of the above - Not really intended for real-world usecases, but does show that this can be used for interesting usecases such as real, remote RAM/Swap - Shows how easy it is to use the resulting library, as the entire project is just ~300 SLOC including backend configuration, flags and other boilerplate Mapping Tape Into Memory With tapisk - Overview - tapisk is an interesting usecase because of how close it is to STFS, which provided the inspiration for the FUSE-based approach - Very high read/write backend latency (multiple seconds, up to 90s, due to seeking) - Linear access, no random reads - Can benefit a lot from asynchronous writes provided by managed mounts - Fast storage acts as a caching layer - Backend is linear, so only one read/write possible at a time - With local backend, writes are de-duplicated automatically and both can be asynchronous/concurrent - Writes go to fast (“local”) backend first, syncer then handles in both directions - Only one concurrent reader/writer makes sense - Syncing intervals to/from can maybe be minutes or more to make it more efficient (since long, connected write intervals prevent having to seek to a different position on disk) - Implementation - Chunking works on tape drive records and blocks - bbolt DB is used as an index - Index maps simulated offsets to real tape records (code snippet from https://github.com/pojntfx/tapisk/blob/main/pkg/index/bbolt.go#L43-L68) - When reading, first the tape looks up the real tape record for the requested offset (code snippet from https://github.com/pojntfx/tapisk/blob/main/pkg/backend/tape.go#L71-L78) - Seeking can then use the accelerated MTSEEK ioctl to fast-forward to a record on the tape (code snippet from https://github.com/pojntfx/tapisk/blob/main/pkg/mtio/tape.go#L25-L40) - After seeking to the block, the chunk is read from the tape into memory - When writing the drive seeks to the end of the tape (unless the last operation was a write already, in which case we’re already at the end) (code snippet from https://github.com/pojntfx/tapisk/blob/main/pkg/mtio/tape.go#L57-L72) - After that, the current real tape block position is requested, stored in the index, and the offset is written to the tape (code snippet from https://github.com/pojntfx/tapisk/blob/main/pkg/backend/tape.go#L87-L119) - Evaluation - tapisk is a unique usecase that shows the versitility of the approach chosen and how flexible it is - E.g. the chunking system didn’t have to be reimplemented (like with STFS) - we could just use the managed mount API directly without changes - This essentially makes the tape a chunked, reusable ReadWriterAt, in the same way as the directory backend - By using a RPC backend, a remote tape can be accessed in the same way, making it possible to map e.g. a remote library’s tape robot to your local memory - Tape backends can also be used to access large amounts of linear data (terrabytes) from a tape as though it were in memory - Makes it possible to access even large datasets or backups in a way that is comfortable and known to developers, compared to other tooling like tar - Can actually be a real usecase to replace LTFS - If a file system (e.g. EXT4) is written to the tape, the tape can be mounted as a filesystem, too - LTFS is a kernel module filesystem for tape drives that makes them a file system the same way as STFS - But is its own filesystem, while tapisk allows using any existing and tested filesystem on top of the generic block device - Doesn’t support the caching, making it hard to use for memory mapping, too - Tapisk again shows how minimal the overhead of r3map is: While LTFS is 10s of thousands of SLOC, tapisk achieves a similar feature set in just under 350 SLOC Improving File System Synchronization Solutions - Existing Solutions - Another potential usecase is using r3map to create a mountable remote filesystem with unique properties - Currently there are two choices on how these can be implemented - Google Drive, Nextcloud etc. listen to file changes on a folder and synchronize files when they change, similarly to the file synchronization approach to memory synchronization - The big drawback is that everything needs to be stored locally - If a lot of data is stored (e.g. terabytes), the locally available files would need to be manually selected - There is no way to dynamically download files this way as they are required - It is however very efficient, since the filesystem is completely transparent to the user (writes are being synced back asynchronously) - It also supports an offline usecase easily - The other option is to use a FUSE, e.g. s3-fuse - This means that files can be fetched on demand - But comes with a heavy performance penalty - Writes are usually done (esp. for s3-fuse) when they are communicated to the FUSE by the kernel, which makes writes very slow - Offline usage is also hard/usually impossible - Also usually not a fully featured system (e.g. not inotify support, missing permissions etc.) - This leaves the choice between two imperfect systems, all with their own downsides - Hybrid Approach - Using r3map can combine both approaches by both not having to download all files in advance and being able to write back changes asynchronously - It is also a fully-featured filesystem - Files can also be downloaded preemptively for offline access, just like with e.g. the Google Drive approach - This can be achieved by using the managed mount API - The block device is formatted using any valid filesystem (e.g. EXT4) and then mounted - If files should be downloaded in the background, the amount of pull workers can be set to anything >0 - Reads to blocks/files that haven’t been read yet can be resolved from the remote backend, giving a FUSE-like experience - Since read block are then written to the local one, making subsequent reads almost as fast as native disk reads (unlike in FUSE/like in the Nextcloud approach) - Writes are done to the local backend and can then be synced up in periodic intervals like with the Nextcloud approach, making them much faster than a FUSE, too - Similarly to how the mount API is used, the migration API could then be used to move the file system between two hosts in a highly efficient way and without a long downtime - This essentially creates a system that bridges the gap between FUSE and file synchronization, once again showing how the memory synchronization tooling can be used to solve essentially the synchronization of any state, including disks Universal Database, Media and Asset Streaming - Streaming access to remote databases - Another usecase is accessing a remote database locally - While using a database backend is one option of storing chunks, an actual database can also be stored in a mount as well - Particularly interesting for in-memory or on-disk databases like SQLite - Instead of having to download the entire SQLite database before using it, it can simply be mounted, and accessed as it is being used - This allows very efficient network access, as only very rarely the entire database is needed - Since reads are cached with the managed mount API, only the first read should potentially have a performance impact - Similarly so writes to the database will be more or less the same throughput as to the local disk, since changes are written back asynchronously - If the full database should eventually be accessible locally, the background pullers can be used - If the location of e.g. indexes is known, then the pull heuristic can be specified to fetch these first to speed up initial queries - SQLite itself doesn’t have to support a special interface or changes for this to work, since it can simply use the mount’s block device as it’s backend, which shows how universally the concept is usable - Making arbitrary file formats streamable - This concept doesn’t just apply to SQLite however - Other uses could for example be to make formats that are usually not streamable due to technical limitations - One such format is famously MP4 - Usually, if someone downloads a MP4, they can’t start playing it before they have finished downloading - This is because MP4s store metadata at the end of the file (graphic with metadata) - The reason for storing this at the end is that generating the metadata requires encoding the video first - Usually, if the file is downloaded from start to finish, then this metadata wouldn’t be accessible, and simply inverting the download order wouldn’t work since the stream would still be stored in different offsets of the file - While there are ways of changing this (for example by storing the metadata in the beginning after the entire MP4 has been transcoded), this is not possible for already existing files - With r3map however, the pull heuristic function can be used to immediately pre-fetch the metadata, and the individual chunks of the MP4 file are then being fetched in the background or as they are being accessed - This allows making a format such as MP4 streamable before it has been fully downloaded yet, or making almost any format streamable no matter the codec without any changes required to the video/audio player or the format - Streaming app and game assets - Another use case of this in-place streaming approach could be in gaming - Usually, all assets of a game are currently being downloaded before the game is playable - For many new AAA games, this can be hundreds of gigabytes - Many of these assets aren’t required to start the game however - Only some of them (e.g. the launcher, UI libraries, first level etc.) are - In theory, one could design a game engine in such a way that these things are only fetched from a remote as they are required - This would however require changing the way new games work, and would not work for existing games - Playing games of e.g. a network drive or FUSE is also unrealistic due to performance limitations - Using the managed mount API, the performance limitations of such an approach can be reduced, without changes to the game - By using the background pull system, reads from chunks that have already been pulled are almost as fast as native disk reads - By analyzing the data access pattern of an existing game, a pull heuristic function can be created which will then preemptively pull the game assets that are required first, keeping latency spikes as low as possible - Using the chunk availability hooks, the game could also be paused for a loading screen until a certain local availability percentage is reached to prevent latency spikes from missing assets, while also still allowing for faster startup times - The same concept could also be applied to launching any application - For many languages, completely loading a binary or script into memory isn’t required for it to start executing - Without significant changes to most interpreters and language VMs however, loading a file from a file system is the only available interface for loading programs - With the managed mount API, this interface can be re-used to add streaming execution support to any intepreter or VM, simply by using the managed mount either as a filesystem to stream multiple binaries/scripts/libraries from, or by using the block device as a file or mmaping it directly to fetch it Universal App State Synchronization and Migration - Modelling state - Synchronization of app state is hard - Even for hand-off scenarios a custom protocol is built most of the times - It is possible to use a database sometimes (e.g. Firebase) to synchronize things - But that can’t sync all data structures and requires using specific APIs to sync things - What if you could mount and/or migrate any ressource? - Usually these structures are marshalled, sent over a network, received on a second system, unmarshalled, and then are done being synced - Requires a complex sync protocol, and when to sync, when to pull etc. is inefficient and usually happens over a third party (e.g. a database) - Data structures can almost always be represented by am []byte - If the data structures are allocated from on a []byte from the block device, we can use the managed mount/direct mount/migration APIs to send/receive them or mount them - Persisting state - This makes a few interesting usecases possible - For example, one could imagine a TODO app - By default, the app mounts the TODO list as a byte slice from a remote server - Since authentication is pluggable and e.g. a database backend with a prefix for the user can be used, this could scaly fairly well - Using the pre-emptive background pulls, when the user connects, they could stream the byte slice in from the remote server as the app is accessing it, and also pre-emptively pull the rest - If the TODO list is modified by changing it in memory, these changes are automatically being written to the local backend - The asynchronous push algorithm can then take these changes and sync them up to the remote backend when required - If the local backend is persistent, such a setup can even survive network outages - Migrating app state between hosts - More interesting possibly than this smart mount, where the remote struct is completely transparent to the user, could be the migration usecase - If a user has the TODO app running on a phone, but wants to continue writing on their desktop, they can migrate the app’s state with r3map - In this usecase, the migration API could be used - The pre-copy phase could be automatically detected if the phone comes physically close to the computer - Since this would be a LAN scenario, the migration would be very fast and allow for interesting hand-off usecases - The migration API also provides hooks and the protocol that would ensure that the app would be closed on the phone before it would be resumed on the desktop, making state corruption during the migration much harder than with a third-party that changes are synced (where the remote state may not be up-to-date yet as it is resumed) - Migrating an app’s TODO list or any internal data structure is one option - Migrating virtual machines - Locking is not handled by this, only one user is supported - The in-memory representation would also need to be universal - This is not the case with Go and for different CPU architectures - Using a toolkit like Apache Arrow can provide a universal format for this memory, but many other things are possible if a layer of indirection is added - But if e.g. Go is compiled to Wasm, and the Wasm VM’s linear memory is pointed to the NBD device, it is - This would also allow storing in the entire app state in the Wasm remote, with no changes to the app itself - As mentioned before, a Wasm VM’s memory could also be migrated this way, allowing for any Wasm app to be transfered between hosts - Similarly so, the Wasm app’s binary, a mounted WASI filesystem etc. could all be synchronized this way, too - Thanks to the preemptive pull mechanism, restoring e.g. a Wasm VM outside a migration would also be a very fast operation - r3map can also be used for actual VM migration, with any hypervisor that supports mapping a VM’s memory to a block device - This could also be other VMs (like e.g. the JVM), a browser’s JS VM, or - if the processor architectures etc. match - an entire process’s memory space - Using the mount API, it would also be possible to bring the concept of VM snapshots to almost any app - Essentially, its VM memory migration and snapshots, but for any kind of state - Its particularly interesting because it can do so without any specific requirements for the data structures of the state other than them being ultimately a []byte, which by definition every state is (def. a process level or VM level etc.) Summary - There are multiple ways and configurations for implementing a solution that allows for universally accessing, synchronizing and migrating memory regions efficiently - Each configuration and API show different strengths and weaknesses in benchmarks - Access methods - userfaultfd is an interesting API and very idiomatic to both Linux as a first-party solution, and Go due to its fairly low implementation overhead, but does fall short in throughput, especially when used in WAN networks, where other options provide better performance - The delta synchronization method for mmaped files does provide a simple way of memory synchronization for specific scenarios, but does have a very significant I/O and compute overhead making it unsuitable for most applications - FUSE provides an extensive API to implement a complete file system in user space, but has significant implementation overhead making it a suboptimal solution for memory synchronization - Direct mounts provide a good access method for LAN deployment scenarios, where networks lave a low latency and the lack of I/O overhead compared to other methods make it a compelling choice - Managed mounts are the preferred access methods for WAN environments due to their efficient use of background push and pull, making it possible for them to adapt to the high latencies typical for such deployments, despite having a slight internal I/O overhead compared to direct mounts - When it comes to RPC framework choice, in most production environments either fRPC or gRPC represent a high-performance offering, where fRPC can offer slightly better average throughput, while gRPC can offer better tooling and developer experience guarantees due to its long legacy - Backends - For memory migration and most synchronization use cases, the file backend can provide a reliable, persistent way of storing a memory region without using up too much host memory - For memory access use cases, Redis has a consistently strong throughput in both managed and direct mount scenarios due to its concurrency optimizations, especially in LAN deployments with low latencies and when ephemeral data is being accessed - Cassandra and ScyllaDB can provide a good backend choice for applications with requirements for strong concurrency guarantees if managed mounts are being used to access a remote resource - These different approaches show that it is possible to adapt the necessary semantics for accessing, synchronizing and migrating memory regions to many backends, wire protocols and APIs - For most real-world applications, the proposed direct mount, managed mount and migration APIs provide a fast and reliable way of achieving this universal method of accessing resources Conclusion - The proposed solution consisting of the direct mount, managed mount and migration APIs implemented in the r3map library present a truly universal way of accessing, synchronizing and migrating memory regions between hosts - It is not only a concept, but a ready-to-use library with good enough throughput and latency for it to be used for real-world scenarios today - Real-world use cases of the solution show this readiness today - ram-dl shows how easy the proposed API is, making it possible to share and mount a remote system’s memory in under 300 source lines of code - tapisk shows that the API can be used to make almost any resource, including a tape drive with linear reads and writes, accessible with the proposed system - Aside from these implemented use cases, the underlying technology makes many other configurations previously thought of as impossible viable - File synchronization - Streaming remote databases and arbitrary formats - Streaming remote game and application assets - While there are limitations with the proposed solution’s underlying technologies, these do provide future research opportunities - For example, the use of Rust as a language that is garbage-collection free in order to prevent potential deadlock issues, or exploring the use of emerging alternatives to creating block devices in userpace such as ublk, could further improve on the implementation presented - The biggest promise of the solution proposed by the thesis however is achieved - it provides a usable, universal way to access, synchronize and migrate generic applications and virtual machines without requiring significant changes to apps or hypervisors respectively - It is also able to provide configurations for both LAN and WAN deployments, making it possible to use memory synchronization technology in completely different, dynamic environments than before - As a result, entirely new ways of thinking about application architecture and lifecycles become possible, which can enable applications to become much simpler to maintain and scale than they are today References Abbrevations API: Application Programming Interface I/O: Input/Output IO: Input/Output OS: Operating System CPU: Central Processing Unit RAM: Random Access Memory SSD: Solid State Drive HDD: Hard Disk Drive UUID: Universally Unique Identifier CRC32: Cyclic Redundancy Check 32-Bit LRU: Least Recently Used WAN: Wide Area Network LAN: Local Area Network TCP: Transmission Control Protocol UDP: User Datagram Protocol P2P: Peer-To-Peer NATs: Network Address Translators IPC: Inter-Process Communication RTT: Round-Trip Time SRP: SCSI RDMA Protocol GNU: GNU’s Not Unix UNIX: UNIX Family of Operating Systems macOS: Apple Macintosh Operating System FreeBSD: Free Berkeley Software Distribution NBD: Network Block Device S3fs: S3 File System NVMe: Non-Volatile Memory Express LTFS: Linear Tape File System LTO: Linear Tape-Open EXT4: Fourth Extended Filesystem Btrfs: B-Tree File System C: C Programming Language Rust: Rust Programming Language Go: Go Programming Language C++: C++ Programming Language ARM: ARM RISC Computer Processor Architecture x86: x86 CISC Computer Processor Architecture RISC-V: RISC-V RISC Computer Processor Architecture LPDDR5: Low-Power Double Data Rate 5 HTTP: Hypertext Transfer Protocol HTTPS: HTTP Secure HTTP/2: HTTP Version 2 QUIC: Quick UDP Internet Connections WebRTC: Web Real-Time Communication Wasm: WebAssembly IETF: Internet Engineering Task Force OIDC: OpenID Connect AWS: Amazon Web Services CNCF: Cloud Native Computing Foundation S3: Simple Storage Service TLS: Transport Layer Security mTLS: Mutual TLS SSH: Secure Shell DoS: Denial of Service JSON: JavaScript Object Notation JSONL: JSON Lines SQL: Structured Query Language NoSQL: Not Only SQL Protobuf: Protocol Buffers IDL: Interface Definition Language DSL: Domain-Specific Language KV: Key-Value Syscalls: System Calls VM: Virtual Machine RPC: Remote Procedure Call REST: Representational State Transfer FUSE: File Systems in Userspace Structure - Abstract - Tables - Content - Graphics - Abbrevations - Introduction: Research question/goal - Technology: Current state of research and introductions to related topics (like a reference book) - Planning: Ideas on how to solve the research question/goal - Implementation: Description of how the ideas were implemented - Results: Which results each of the methods yielded (i.e. benchmarks) - Discussion: How the individual results can be interpreted and what use cases there are for the methods - Summary: How the discussion can be interpreted as a whole - Conclusion: Answer to the research question and future outlook, future research questions Citations [1] A. S. Tanenbaum and A. S. Woodhull, “Operating systems: Design and implementation,” 3rd ed., Upper Saddle River, NJ 07458: Pearson Education, Inc. Pearson Prentice Hall, 2006, pp. 27–29. [2] M. Kerrisk, The linux programming interface: A linux and UNIX system programming handbook. No Starch Press, 2010. [3] D. DeVault, “A hare code generator for finding ioctl numbers.” 2022. Accessed: Jul. 28, 2023. [Online]. Available: https://drewdevault.com/2022/05/14/generating-ioctls.html [4] Kernel Development Community, “Quick start.” 2023. Accessed: Jul. 19, 2023. [Online]. Available: https://www.kernel.org/doc/html/next/rust/quick-start.html [5] R. Love, Linux kernel development, 3rd ed. Pearson Education, Inc., 2010, pp. 8, 343–344. [6] W. Mauerer, Professional linux kernel architecture. Indianapolis, IN: Wiley Publishing, Inc., 2008, pp. 2–4, 7–8, 474–487, 1026–1027. [7] W. R. Stevens, Advanced programming in the UNIX environment. Delhi: Addison Wesley Logman (Singapore) Pte Ltd., Indian Branch, 2000, pp. 19, 407–411. [8] K. A. Robbins and S. Robbins, Unix™ systems programming: Communication, concurrency, and threads. Prentice Hall PTR, 2003, p. 313. [9] W. R. Stevens, B. Fenner, and A. M. Rudoff, UNIX® network programming volume 1, third edition: The sockets networking API. Addison Wesley, 2003, pp. 507–531. [10] A. J. Smith, “Cache memories,” ACM Comput. Surv., vol. 14, no. 3, p. 474, Sep. 1982, doi: 10.1145/356887.356892. [11] H. A. Maruf and M. Chowdhury, “Memory disaggregation: Advances and open challenges.” 2023.Available: https://arxiv.org/abs/2305.03943 [12] J. Bonwick, “The slab allocator: An Object-Caching kernel,” Jun. 1994.Available: https://www.usenix.org/conference/usenix-summer-1994-technical-conference/slab-allocator-object-caching-kernel [13] M. Gorman, Understanding the linux virtual memory manager. Upper Saddle River, New Jersey 07458: Pearson Education, Inc. Publishing as Prentice Hall Professional Technical Reference, 2004, pp. 53–57, 179. [14] A. Silberschatz, P. B. Galvin, and G. Gagne, Operating system concepts, 10th ed. Hoboken, NJ: Wiley, 2018, pp. 425–426.Available: https://lccn.loc.gov/2017043464 [15] J. Choi, J. Kim, and H. Han, “Efficient memory mapped file I/O for In-Memory file systems,” Jul. 2017.Available: https://www.usenix.org/conference/hotstorage17/program/presentation/choi [16] M. Prokop, “Inotify: Efficient, real-time linux file system event monitoring,” Apr. 2010. https://www.infoq.com/articles/inotify-linux-file-system-event-monitoring/ (accessed Jul. 19, 2023). [17] A. S. Tanenbaum, Computer networks. Pearson Education, Inc. Publishing as Prentice Hall PTR, 2003, pp. 21, 23. [18] “Transmission Control Protocol.” RFC 793; J. Postel, Sep. 1981. doi: 10.17487/RFC0793. [19] “User Datagram Protocol.” RFC 768; J. Postel, Aug. 1980. doi: 10.17487/RFC0768. [20] E. Rescorla, “The Transport Layer Security (TLS) Protocol Version 1.3.” RFC 8446, Aug. 2018. doi: 10.17487/RFC8446. [21] A. Langley et al., “The QUIC transport protocol: Design and internet-scale deployment,” in Proceedings of the conference of the ACM special interest group on data communication, 2017, pp. 183–196. doi: 10.1145/3098822.3098842. [22] H. Xiao et al., “Towards web-based delta synchronization for cloud storage services,” in 16th USENIX conference on file and storage technologies (FAST 18), Feb. 2018, pp. 155–168.Available: https://www.usenix.org/conference/fast18/presentation/xiao [23] B. K. R. Vangoor, V. Tarasov, and E. Zadok, “To FUSE or not to FUSE: Performance of User-Space file systems,” in 15th USENIX conference on file and storage technologies (FAST 17), Feb. 2017, pp. 59–72.Available: https://www.usenix.org/conference/fast17/technical-sessions/presentation/vangoor [24] E. Blake, W. Verhelst, and other NBD maintainers, “The NBD protocol.” Apr. 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md [25] S. He, C. Hu, B. Shi, T. Wo, and B. Li, “Optimizing virtual machine live migration without shared storage in hybrid clouds,” in 2016 IEEE 18th international conference on high performance computing and communications; IEEE 14th international conference on smart city; IEEE 2nd international conference on data science and systems (HPCC/SmartCity/DSS), 2016, pp. 921–928. doi: 10.1109/HPCC-SmartCity-DSS.2016.0132. [26] A. Baruchi, E. Toshimi Midorikawa, and L. Matsumoto Sato, “Reducing virtual machine live migration overhead via workload analysis,” IEEE Latin America Transactions, vol. 13, no. 4, pp. 1178–1186, 2015, doi: 10.1109/TLA.2015.7106373. [27] T. Akidau, S. Chernyak, and R. Lax, Streaming systems. Sebastopol, CA: O’Reilly Media, Inc., 2018, pp. 23–25. [28] J. D. Peek, UNIX power tools. Sebastopol, CA; New York: O’Reilly Associates; Bantam Books, 1994, pp. 4–5. [29] A. A. A. Donovan and B. W. Kernighan, The go programming language. Addison-Wesley Professional, 2015, pp. 217–221, 255–227. [30] gRPC Contributors, “Introduction to gRPC.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://grpc.io/docs/what-is-grpc/introduction/ [31] Loophole Labs, Inc, “fRPC: A modern RPC framework designed for high performance and stability.” 2022. Accessed: Jul. 23, 2023. [Online]. Available: https://frpc.io/getting-started/overview [32] S. Vij, A. Sørlie, F. Pojtinger, and J. Sun, “Frisbee-go: Bring-your-own protocol messaging framework designed for performance and stability.” 2022. Accessed: Jul. 28, 2023. [Online]. Available: https://github.com/loopholelabs/frisbee-go [33] S. Vij, A. Sørlie, and F. Pojtinger, “Polyglot: A high-performance serialization framework used for encoding and decoding arbitrary data structures across languages.” 2022. Accessed: Jul. 23, 2023. [Online]. Available: https://github.com/loopholelabs/polyglot [34] Redis Ltd, “Introduction to redis.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://redis.io/docs/about/ [35] Redis Ltd, “Redis pub/sub.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://redis.io/docs/interact/pubsub/ [36] Amazon Web Services, Inc, “What is amazon S3?” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html [37] MinIO, Inc, “Core administration concepts.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://min.io/docs/minio/kubernetes/upstream/administration/concepts.html [38] A. Lakshman and P. Malik, “Cassandra: A decentralized structured storage system,” SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, Apr. 2010, doi: 10.1145/1773912.1773922. [39] ScyllaDB, Inc., “ScyllaDB ring architecture - overview.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://opensource.docs.scylladb.com/stable/architecture/ringarchitecture/index.html [40] P. Grabowski, J. Stasiewicz, and K. Baryla, “Apache cassandra 4.0 performance benchmark: Comparing cassandra 4.0, cassandra 3.11 and scylla open source 4.4,” ScyllaDB Inc, 2021.Available: https://www.scylladb.com/wp-content/uploads/wp-apache-cassandra-4-performance-benchmark-3.pdf [41] F. Pojtinger, “STFS: Simple Tape File System, a file system for tapes and tar files.” 2022. Accessed: Jul. 20, 2023. [Online]. Available: https://github.com/pojntfx/stfs [42] J. Waibel and F. Pojtinger, “sile-fystem: A generic FUSE implementation.” 2022. Accessed: Jul. 20, 2023. [Online]. Available: https://github.com/jakWai01/sile-fystem