Efficient Synchronization of Linux Memory Regions over a Network: A Comparative Study and Implementation (Notes)

A user-friendly approach to application-agnostic state synchronization

Felicitas Pojtinger (Stuttgart Media University)

2023-08-03

Abstract

1.1 Introduction

1.2 Technology

1.2.1 User Space and Kernel Space

1.2.2 The Linux Kernel

1.2.3 Linux Kernel Modules

1.2.4 UNIX Signals and Handlers

1.2.5 UNIX Sockets

1.2.6 Memory Hierarchy

1.2.7 Memory Management in Linux

1.2.8 Swap Space

1.2.9 Page Faults

1.2.10 mmap

1.2.11 inotify

1.2.12 Linux Kernel Disk and File Caching

1.2.13 LAN and WAN

1.2.14 TCP, UDP, TLS and QUIC

1.2.15 Delta Synchronization

1.2.16 File Systems In Userspace (FUSE)

1.2.17 Network Block Device (NBD)

1.2.18 Virtual Machine Live Migration

1.2.18.1 Pre-Copy

1.2.18.2 Post-Copy

1.2.18.3 Workload Analysis

1.2.19 Streams and Pipelines

1.2.20 Go

1.2.21 gRPC

1.2.22 fRPC and Polyglot

1.2.23 Redis

1.2.24 S3 and Minio

1.2.25 Cassandra and ScylllaDB

1.3 Planning

1.3.1 Pull-Based Synchronization With userfaultfd

1.3.2 Push-Based Synchronization With mmap and Hashing

1.3.3 Push-Pull Synchronization with FUSE

1.3.4 Mounts with NBD

1.3.5 Push-Pull Synchronization with Mounts

1.3.6 Pull-Based Synchronization with Migrations

1.4 Implementation

1.4.1 Userfaults in Go with userfaultfd

1.4.2 File-Based Synchronization

1.4.3 FUSE Implementation in Go

1.4.4 NBD with go-nbd

1.4.5 Mounts

1.4.6 Live Migration

1.4.7 Pluggable Encryption and Authentication

1.4.8 Optimizing Backends For High RTT

1.4.9 Using Remote Stores as Backends

1.4.10 Bi-Directional and Concurrent RPCs with Dudirekta

1.4.11 Connection Pooling with gRPC

1.4.12 Optimizing RPC Throughput and Latency with fRPC

1.5 Results

1.5.1 Testing Environment

1.5.2 Access Methods

1.5.3 Initialization

1.5.4 Chunking

1.5.5 RPC frameworks

1.5.6 Backends

1.6 Discussion

1.6.1 Userfaults

1.6.2 File-Based Synchronization

1.6.3 FUSE

1.6.4 Direct Mounts

1.6.5 Managed Mounts

1.6.6 Chunking

1.6.7 RPC Frameworks

1.6.8 Backends

1.6.9 NBD

1.6.10 Language Limitations

1.6.11 Use Cases

1.6.11.1 Remote Swap With ram-dl

1.6.11.2 Mapping Tape Into Memory With tapisk

1.6.11.3 Improving File System Synchronization Solutions

1.6.11.4 Universal Database, Media and Asset Streaming

1.6.11.5 Universal App State Synchronization and Migration

1.7 Summary

1.8 Conclusion

1.9 References

1.9.1 Abbrevations

API: Application Programming Interface I/O: Input/Output IO: Input/Output OS: Operating System CPU: Central Processing Unit RAM: Random Access Memory SSD: Solid State Drive HDD: Hard Disk Drive UUID: Universally Unique Identifier CRC32: Cyclic Redundancy Check 32-Bit LRU: Least Recently Used WAN: Wide Area Network LAN: Local Area Network TCP: Transmission Control Protocol UDP: User Datagram Protocol P2P: Peer-To-Peer NATs: Network Address Translators IPC: Inter-Process Communication RTT: Round-Trip Time SRP: SCSI RDMA Protocol GNU: GNU’s Not Unix UNIX: UNIX Family of Operating Systems macOS: Apple Macintosh Operating System FreeBSD: Free Berkeley Software Distribution NBD: Network Block Device S3fs: S3 File System NVMe: Non-Volatile Memory Express LTFS: Linear Tape File System LTO: Linear Tape-Open EXT4: Fourth Extended Filesystem Btrfs: B-Tree File System C: C Programming Language Rust: Rust Programming Language Go: Go Programming Language C++: C++ Programming Language ARM: ARM RISC Computer Processor Architecture x86: x86 CISC Computer Processor Architecture RISC-V: RISC-V RISC Computer Processor Architecture LPDDR5: Low-Power Double Data Rate 5 HTTP: Hypertext Transfer Protocol HTTPS: HTTP Secure HTTP/2: HTTP Version 2 QUIC: Quick UDP Internet Connections WebRTC: Web Real-Time Communication Wasm: WebAssembly IETF: Internet Engineering Task Force OIDC: OpenID Connect AWS: Amazon Web Services CNCF: Cloud Native Computing Foundation S3: Simple Storage Service TLS: Transport Layer Security mTLS: Mutual TLS SSH: Secure Shell DoS: Denial of Service JSON: JavaScript Object Notation JSONL: JSON Lines SQL: Structured Query Language NoSQL: Not Only SQL Protobuf: Protocol Buffers IDL: Interface Definition Language DSL: Domain-Specific Language KV: Key-Value Syscalls: System Calls VM: Virtual Machine RPC: Remote Procedure Call REST: Representational State Transfer FUSE: File Systems in Userspace

1.9.2 Structure

Citations

[1]
A. S. Tanenbaum and A. S. Woodhull, “Operating systems: Design and implementation,” 3rd ed., Upper Saddle River, NJ 07458: Pearson Education, Inc. Pearson Prentice Hall, 2006, pp. 27–29.
[2]
M. Kerrisk, The linux programming interface: A linux and UNIX system programming handbook. No Starch Press, 2010.
[3]
D. DeVault, “A hare code generator for finding ioctl numbers.” 2022. Accessed: Jul. 28, 2023. [Online]. Available: https://drewdevault.com/2022/05/14/generating-ioctls.html
[4]
Kernel Development Community, “Quick start.” 2023. Accessed: Jul. 19, 2023. [Online]. Available: https://www.kernel.org/doc/html/next/rust/quick-start.html
[5]
R. Love, Linux kernel development, 3rd ed. Pearson Education, Inc., 2010, pp. 8, 343–344.
[6]
W. Mauerer, Professional linux kernel architecture. Indianapolis, IN: Wiley Publishing, Inc., 2008, pp. 2–4, 7–8, 474–487, 1026–1027.
[7]
W. R. Stevens, Advanced programming in the UNIX environment. Delhi: Addison Wesley Logman (Singapore) Pte Ltd., Indian Branch, 2000, pp. 19, 407–411.
[8]
K. A. Robbins and S. Robbins, Unix™ systems programming: Communication, concurrency, and threads. Prentice Hall PTR, 2003, p. 313.
[9]
W. R. Stevens, B. Fenner, and A. M. Rudoff, UNIX® network programming volume 1, third edition: The sockets networking API. Addison Wesley, 2003, pp. 507–531.
[10]
A. J. Smith, “Cache memories,” ACM Comput. Surv., vol. 14, no. 3, p. 474, Sep. 1982, doi: 10.1145/356887.356892.
[11]
H. A. Maruf and M. Chowdhury, “Memory disaggregation: Advances and open challenges.” 2023.Available: https://arxiv.org/abs/2305.03943
[12]
J. Bonwick, “The slab allocator: An Object-Caching kernel,” Jun. 1994.Available: https://www.usenix.org/conference/usenix-summer-1994-technical-conference/slab-allocator-object-caching-kernel
[13]
M. Gorman, Understanding the linux virtual memory manager. Upper Saddle River, New Jersey 07458: Pearson Education, Inc. Publishing as Prentice Hall Professional Technical Reference, 2004, pp. 53–57, 179.
[14]
A. Silberschatz, P. B. Galvin, and G. Gagne, Operating system concepts, 10th ed. Hoboken, NJ: Wiley, 2018, pp. 425–426.Available: https://lccn.loc.gov/2017043464
[15]
J. Choi, J. Kim, and H. Han, “Efficient memory mapped file I/O for In-Memory file systems,” Jul. 2017.Available: https://www.usenix.org/conference/hotstorage17/program/presentation/choi
[16]
M. Prokop, “Inotify: Efficient, real-time linux file system event monitoring,” Apr. 2010. https://www.infoq.com/articles/inotify-linux-file-system-event-monitoring/ (accessed Jul. 19, 2023).
[17]
A. S. Tanenbaum, Computer networks. Pearson Education, Inc. Publishing as Prentice Hall PTR, 2003, pp. 21, 23.
[18]
Transmission Control Protocol.” RFC 793; J. Postel, Sep. 1981. doi: 10.17487/RFC0793.
[19]
User Datagram Protocol.” RFC 768; J. Postel, Aug. 1980. doi: 10.17487/RFC0768.
[20]
E. Rescorla, The Transport Layer Security (TLS) Protocol Version 1.3.” RFC 8446, Aug. 2018. doi: 10.17487/RFC8446.
[21]
A. Langley et al., “The QUIC transport protocol: Design and internet-scale deployment,” in Proceedings of the conference of the ACM special interest group on data communication, 2017, pp. 183–196. doi: 10.1145/3098822.3098842.
[22]
H. Xiao et al., “Towards web-based delta synchronization for cloud storage services,” in 16th USENIX conference on file and storage technologies (FAST 18), Feb. 2018, pp. 155–168.Available: https://www.usenix.org/conference/fast18/presentation/xiao
[23]
B. K. R. Vangoor, V. Tarasov, and E. Zadok, “To FUSE or not to FUSE: Performance of User-Space file systems,” in 15th USENIX conference on file and storage technologies (FAST 17), Feb. 2017, pp. 59–72.Available: https://www.usenix.org/conference/fast17/technical-sessions/presentation/vangoor
[24]
E. Blake, W. Verhelst, and other NBD maintainers, “The NBD protocol.” Apr. 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md
[25]
S. He, C. Hu, B. Shi, T. Wo, and B. Li, “Optimizing virtual machine live migration without shared storage in hybrid clouds,” in 2016 IEEE 18th international conference on high performance computing and communications; IEEE 14th international conference on smart city; IEEE 2nd international conference on data science and systems (HPCC/SmartCity/DSS), 2016, pp. 921–928. doi: 10.1109/HPCC-SmartCity-DSS.2016.0132.
[26]
A. Baruchi, E. Toshimi Midorikawa, and L. Matsumoto Sato, “Reducing virtual machine live migration overhead via workload analysis,” IEEE Latin America Transactions, vol. 13, no. 4, pp. 1178–1186, 2015, doi: 10.1109/TLA.2015.7106373.
[27]
T. Akidau, S. Chernyak, and R. Lax, Streaming systems. Sebastopol, CA: O’Reilly Media, Inc., 2018, pp. 23–25.
[28]
J. D. Peek, UNIX power tools. Sebastopol, CA; New York: O’Reilly Associates; Bantam Books, 1994, pp. 4–5.
[29]
A. A. A. Donovan and B. W. Kernighan, The go programming language. Addison-Wesley Professional, 2015, pp. 217–221, 255–227.
[30]
gRPC Contributors, “Introduction to gRPC.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://grpc.io/docs/what-is-grpc/introduction/
[31]
Loophole Labs, Inc, “fRPC: A modern RPC framework designed for high performance and stability.” 2022. Accessed: Jul. 23, 2023. [Online]. Available: https://frpc.io/getting-started/overview
[32]
S. Vij, A. Sørlie, F. Pojtinger, and J. Sun, “Frisbee-go: Bring-your-own protocol messaging framework designed for performance and stability.” 2022. Accessed: Jul. 28, 2023. [Online]. Available: https://github.com/loopholelabs/frisbee-go
[33]
S. Vij, A. Sørlie, and F. Pojtinger, “Polyglot: A high-performance serialization framework used for encoding and decoding arbitrary data structures across languages.” 2022. Accessed: Jul. 23, 2023. [Online]. Available: https://github.com/loopholelabs/polyglot
[34]
Redis Ltd, “Introduction to redis.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://redis.io/docs/about/
[35]
Redis Ltd, “Redis pub/sub.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://redis.io/docs/interact/pubsub/
[36]
Amazon Web Services, Inc, “What is amazon S3?” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
[37]
MinIO, Inc, “Core administration concepts.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://min.io/docs/minio/kubernetes/upstream/administration/concepts.html
[38]
A. Lakshman and P. Malik, “Cassandra: A decentralized structured storage system,” SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, Apr. 2010, doi: 10.1145/1773912.1773922.
[39]
ScyllaDB, Inc., “ScyllaDB ring architecture - overview.” 2023. Accessed: Jul. 18, 2023. [Online]. Available: https://opensource.docs.scylladb.com/stable/architecture/ringarchitecture/index.html
[40]
P. Grabowski, J. Stasiewicz, and K. Baryla, “Apache cassandra 4.0 performance benchmark: Comparing cassandra 4.0, cassandra 3.11 and scylla open source 4.4,” ScyllaDB Inc, 2021.Available: https://www.scylladb.com/wp-content/uploads/wp-apache-cassandra-4-performance-benchmark-3.pdf
[41]
F. Pojtinger, STFS: Simple Tape File System, a file system for tapes and tar files.” 2022. Accessed: Jul. 20, 2023. [Online]. Available: https://github.com/pojntfx/stfs
[42]
J. Waibel and F. Pojtinger, sile-fystem: A generic FUSE implementation.” 2022. Accessed: Jul. 20, 2023. [Online]. Available: https://github.com/jakWai01/sile-fystem