Uni Distributed Systems Condensed Summary
Felicitas Pojtinger
2023-02-05

-   1 Meta
    -   1.1 Contributing
    -   1.2 License
-   2 Introduction to Distributed Systems
    -   2.1 Laws and Terms
        -   2.1.1 Metcalfe’s Law
        -   2.1.2 Generalized Queuing Theory Terms (Henry Liu)
        -   2.1.3 Little’s Law
        -   2.1.4 Hejunka
        -   2.1.5 Amdahl’s Law
    -   2.2 Process and I/O Models
        -   2.2.1 Different Process Models
        -   2.2.2 Questions for Process Models
        -   2.2.3 Different I/O Models
        -   2.2.4 Questions for I/O Models
-   3 Message Protocols
    -   3.1 Delivery Guarantees
        -   3.1.1 The Role of Delivery Guarantees
        -   3.1.2 Why is TCP not Enough?
        -   3.1.3 Different Levels of Timeouts
        -   3.1.4 Delivery Guarantees for RPCs
    -   3.2 Idempotency
        -   3.2.1 Overview
        -   3.2.2 Server State and Idempotency
        -   3.2.3 Implementing Delivery Guarantees for Idempotent
            Requests
        -   3.2.4 Repeating Non-Idempotent Operations
    -   3.3 Order
        -   3.3.1 Request Order in Multi-Point-Protocols
        -   3.3.2 Request Ordering with Multiple Nodes
        -   3.3.3 Implementing Causal Ordered Broadcasts
        -   3.3.4 Implementing Absolutely Ordered Casts
        -   3.3.5 Reliable Messaging
        -   3.3.6 Transaction Models
-   4 Theoretical Foundations of Distributed Systems
    -   4.1 Foundational Concepts
        -   4.1.1 The Eight Fallacies of Distributed Computing
        -   4.1.2 Analyzing Latency
        -   4.1.3 Characteristics of Distributed Systems
    -   4.2 Consistency
        -   4.2.1 Liveness vs. Correctness
        -   4.2.2 Liveness and Correctness in Practice
        -   4.2.3 Timing Models
        -   4.2.4 The Fischer, Lynch and Patterson Result
    -   4.3 The CAP Theorem
        -   4.3.1 Overview
        -   4.3.2 Preconditions for the CAP Theorem
        -   4.3.3 Common Misconceptions of the CAP Theorem
        -   4.3.4 The Modern View of the CAP Theorem
        -   4.3.5 PACELC
    -   4.4 Failure
        -   4.4.1 Technical Failures
        -   4.4.2 Failure Types
        -   4.4.3 Failure Models
        -   4.4.4 Failures and Timeouts
        -   4.4.5 Failure Detectors
    -   4.5 Clocks
        -   4.5.1 Time in Distributed Systems
        -   4.5.2 Consistent vs. Inconsistent Cuts
        -   4.5.3 Event Clocks (Logical Clocks)
        -   4.5.4 Lamport Logical Clock
        -   4.5.5 Vector Clocks
        -   4.5.6 Physical Interval Time (TrueTime)
        -   4.5.7 Hybrid Clocks
        -   4.5.8 Ordering in Distributed Event Systems
    -   4.6 Consensus
        -   4.6.1 Overview
        -   4.6.2 Algorithms for Consensus
        -   4.6.3 Two-Phase Commit (2PC)
        -   4.6.4 Quorum-Based Consensus
        -   4.6.5 Raft
    -   4.7 Broadcasting
        -   4.7.1 Atomic Broadcast Conditions
        -   4.7.2 Atomic Broadcast Protocol
        -   4.7.3 Gossip Protocols
        -   4.7.4 DWAL
        -   4.7.5 Design Components of DWALs
    -   4.8 Replication
        -   4.8.1 Properties of Replication Models
        -   4.8.2 Single-Leader Replication
        -   4.8.3 Eventually Consistent Reads
        -   4.8.4 Multi-Master Replication
        -   4.8.5 Leaderless Quorum Replication
        -   4.8.6 Session Modes of Asynchronous Replication
        -   4.8.7 Global Modes of Replication
-   5 Distributed Services and Algorithms I
    -   5.1 Types of Distributed Services
        -   5.1.1 What is a Distributed Service?
        -   5.1.2 Services and Instances
        -   5.1.3 Core Distributed Services
    -   5.2 Availability
        -   5.2.1 Typical Hardware Causes for Downtime
        -   5.2.2 Availability through Redundancy
        -   5.2.3 3 Copy Disaster Recover Solution
    -   5.3 Load Balancing
        -   5.3.1 Serial vs. Redundant Availability
        -   5.3.2 Global Server Load Balancing
        -   5.3.3 Failover with Virtual IPs
        -   5.3.4 Failover, Load Balancing and Session State
        -   5.3.5 P2P Load Balancing
    -   5.4 Caching
        -   5.4.1 CQRS (Command Query Responsibility Segregation)
        -   5.4.2 Requirements of Caching Services
        -   5.4.3 Handling Changing Machine Counts in Caching Services
        -   5.4.4 Consistent Hashing Algorithms
        -   5.4.5 Cache Patterns
        -   5.4.6 Cache Design Considerations
    -   5.5 Events
        -   5.5.1 Local vs. Distributed Events
        -   5.5.2 Asynchronous Event-Processing
        -   5.5.3 Features of Event-Driven Interaction
        -   5.5.4 Types of Message Oriented Middleware (MOMs)
        -   5.5.5 ZeroMQ
    -   5.6 Sharding
        -   5.6.1 Horizontal vs. Vertical Sharding
        -   5.6.2 Sharding Strategies
        -   5.6.3 Horizontal Sharding Functions
        -   5.6.4 Consequences of Sharding
    -   5.7 Relationships
        -   5.7.1 Overview
        -   5.7.2 Functional Requirements of Relationship Services
        -   5.7.3 Why Relationship Services Failed
-   6 Distributed Services and Algorithms II
    -   6.1 Problems with Classic Concurrency
        -   6.1.1 Why Truth is Expensive
        -   6.1.2 Aspects of Classic Distributed Consistency
    -   6.2 Locking
        -   6.2.1 Locking Against Concurrent Access
        -   6.2.2 Optimistic Locking
        -   6.2.3 Serializability with Two-Phase Locking
        -   6.2.4 Deadlocks
        -   6.2.5 Distributed Deadlock Detection
    -   6.3 Transactions
        -   6.3.1 Classic ACID Definitions
        -   6.3.2 Transaction Properties and Mechanisms
        -   6.3.3 Serializability and Isolation
        -   6.3.4 Transaction API
        -   6.3.5 Components of Distributed Transactions
        -   6.3.6 Service Context
        -   6.3.7 Distributed Two-Phase Commit
        -   6.3.8 Failure Models of Distributed Transactions
        -   6.3.9 Special Problems of Distributed Transactions
        -   6.3.10 Transaction Types
        -   6.3.11 ANSI Transaction Levels
    -   6.4 NoSQL
        -   6.4.1 Forces behind NoSQL
        -   6.4.2 Scaling Problems of RDBMs
        -   6.4.3 NoSQL Design Patterns
        -   6.4.4 DynamoDB Design Principles
    -   6.5 Beyond Relaxed Consistency
        -   6.5.1 Overview
        -   6.5.2 CALM Principle
        -   6.5.3 CALM Operations
        -   6.5.4 CRDTs
        -   6.5.5 Bending the Problem
        -   6.5.6 Examples of CRDTs
        -   6.5.7 Distributed Coordination
    -   6.6 Distributed Coordination Services
        -   6.6.1 Zookeeper API
        -   6.6.2 Primary-Order Atomic Broadcast with Zab
        -   6.6.3 Consistency Requirements for ABCast (Reliable Ordered
            Atomic Broadcast)
        -   6.6.4 Primary Order
        -   6.6.5 HA Transactions
-   7 Design of Distributed Systems
    -   7.1 Design Principles for Distributed Systems
        -   7.1.1 Overview
        -   7.1.2 Sharing Ressources and Data
        -   7.1.3 Connection Pooling
        -   7.1.4 Horizonal Scaling/Parallelization
        -   7.1.5 Caching and Replication
        -   7.1.6 End-to-End Argument
        -   7.1.7 Design Methodology
        -   7.1.8 Uncomfortable Real-World Questions
        -   7.1.9 Best Practices for Designing Services
    -   7.2 Architecture Fields
        -   7.2.1 Overview
        -   7.2.2 Information Architecture (to analyze Caching)
        -   7.2.3 Distribution Architcture
        -   7.2.4 Service Access Layer
        -   7.2.5 Physical and Process Architecture
        -   7.2.6 Architecture Validation
    -   7.3 Fan-Out Architecture
        -   7.3.1 Overview
        -   7.3.2 Reliability Issues in Dependencies
        -   7.3.3 Fragments
        -   7.3.4 Latency Reduction & Tolerance
        -   7.3.5 Avoiding Getting Stuck
    -   7.4 Containing Failures
        -   7.4.1 Circuit Breakers
        -   7.4.2 Bulkheads
        -   7.4.3 Blast Reduction

Meta

Contributing

These study materials are heavily based on professor Kriha’s “Verteilte
Systeme” lecture at HdM Stuttgart and prior work of fellow students.

Found an error or have a suggestion? Please open an issue on GitHub
(github.com/pojntfx/uni-distributedsystems-notes):

[QR code to source repository]

QR code to source repository

If you like the study materials, a GitHub star is always appreciated :)

License

[AGPL-3.0 license badge]

AGPL-3.0 license badge

Uni Distributed Systems Notes (c) 2023 Felicitas Pojtinger and
contributors

SPDX-License-Identifier: AGPL-3.0

Introduction to Distributed Systems

Laws and Terms

Metcalfe’s Law

Metcalfe’s law is a principle that states that the value or utility of a
network increases as the number of users in the network increases. This
is because the more people who are using the network, the more useful it
becomes as a platform for communication, collaboration, and the exchange
of information and resources. The adoption rate of a network also tends
to increase in proportion to the utility provided by the network, which
is why companies often give away software or other products for free in
order to increase the size of their user base and the value of their
network.

Metcalfe’s law is often cited as a factor that can contribute to the
emergence of scale-free, or power law, distributions in networks. This
type of distribution is characterized by a few nodes (or users) with
many connections, and many nodes with only a few connections. The
existence of network effects, in which the value of a network increases
with the number of users, can help to explain why we don’t see many
Facebooks or Googles – it can be difficult for new networks to gain
traction and achieve the same level of utility as established networks
with a large user base.

Generalized Queuing Theory Terms (Henry Liu)

-   Server/Node: A combination of a wait queue and a processing element
-   Initiator: The entity that initiates a service request
-   Wait time: The time a request or initiator spends waiting in line
    for service
-   Service time: The time it takes for the processing element to
    complete a request
-   Arrival rate: The rate at which requests arrive for service
-   Utilization: The percentage of time the processing element spends
    servicing requests, as opposed to being idle
-   Queue length: The total number of requests waiting and being
    serviced
-   Response time: The sum of the wait time and service time for a
    single visit to the processing element
-   Residence time: The total time spent by the processing element on a
    single transaction, if it is visited multiple times
-   Throughput: The rate at which requests are serviced, or how fast
    requests can be processed without long wait times.

Little’s Law

-   Little’s Law states that in a stable system, the long-term average
    number of customers (L) is equal to the long-term average effective
    arrival rate (λ) multiplied by the average time a customer spends in
    the system (W).
-   This can be expressed algebraically as L = λW.
-   Little’s Law is used to analyze and understand the behavior of
    systems that involve waiting, such as queues or lines. It can help
    to predict the average number of customers in a system, as well as
    the average time they will spend waiting, given a certain arrival
    rate.

Hejunka

Hejunka is a Japanese term that refers to the practice of leveling the
production process by smoothing out fluctuations in demand and task
sizes. It is often used in lean manufacturing and just-in-time (JIT)
production systems to improve the efficiency and flow of work through a
system.

The goal of Hejunka is to create a steady, predictable flow of work
through the system by reducing variability in task sizes and demand.
This can be achieved through a variety of methods, such as:

-   Setting limits on the number of tasks or requests that can be
    processed at any given time
-   Balancing the workload across different servers or processing
    elements
-   Prioritizing tasks based on their importance or impact on the
    overall system
-   Using techniques such as batching or grouping similar tasks together
    to reduce variability

By leveling the production process and reducing variability in task
sizes, Hejunka can help to improve the efficiency and flow of work
through a system, and reduce the risk of bottlenecks or delays caused by
large differences in task size.

Amdahl’s Law

According to Amdahl’s Law, the maximum improvement in overall system
performance that can be achieved by improving a particular part of the
system is limited by the fraction of time that the improved part of the
system is used. In other words, if only a small portion of the system’s
workload is affected by the improvement, the overall improvement in
performance will also be small.

$speedup = \frac{1}{(1-parallelfraction+\frac{parallelfraction}{numberofprocessors}}$

For example, if a particular part of a system is improved so that it
runs twice as fast, but that part of the system is only used 10% of the
time, the overall improvement in system performance will be limited to a
10% increase. On the other hand, if the improved part of the system is
used 50% of the time, the overall improvement in performance will be
much larger, at 50%.

Amdahl’s Law is often used to understand the potential benefits and
limitations of optimizing or improving specific parts of a system. It
can be a useful tool for determining how much resources should be
invested in improving a particular part of the system, and for
understanding the potential impact of those improvements on overall
system performance.

Process and I/O Models

Different Process Models

-   Single Thread/Single Core: This type of process model involves a
    single thread of execution running on a single core. This can be
    efficient for certain types of workloads, but may not be able to
    take full advantage of multiple cores or processors.
-   Multi-Thread/Single Core: This type of process model involves
    multiple threads of execution running on a single core. This can
    allow the system to perform multiple tasks concurrently, but may not
    be able to fully utilize the processing power of multiple cores or
    processors.
-   Multi-Thread/Multi-Core: This type of process model involves
    multiple threads of execution running on multiple cores or
    processors. This can allow the system to fully utilize the
    processing power of multiple cores or processors, and can be more
    efficient for certain types of workloads.
-   Single Thread/Multi-Process: This type of process model involves a
    single thread of execution running within each of multiple
    processes. This can allow the system to take advantage of multiple
    cores or processors, but may be less efficient than other models for
    certain types of workloads.

Questions for Process Models

-   Can it use available cores/CPUs?
-   What is the ideal number of threads?
-   How does it deal with delays/(b)locking?
-   How does it deal with slow requests/uploads?
-   Is there observable non-determinism aka race conditions?
-   Is locking/synchronization needed?
-   What is the overhead of context switches and memory?

Different I/O Models

-   Synchronous Blocking (Java before NIO/AIO): Prior to the
    introduction of the Java New I/O (NIO) and Asynchronous I/O (AIO)
    APIs, Java had a different model for handling input/output (I/O)
    operations. This model involved using threads to block and wait for
    I/O operations to complete, which could be inefficient and consume a
    lot of system resources.
-   Synchronous Non-Blocking (Polling pattern): The polling pattern is a
    way of handling I/O operations in which a central component
    periodically checks for the completion of I/O operations. This can
    be done by repeatedly calling a function that checks the status of
    the operation, or by using a timer to trigger the check at regular
    intervals.
-   Asynchronous Blocking (Reactor pattern): The Reactor pattern is a
    way of handling I/O operations in which a central component is
    notified when an I/O operation is completed, rather than
    periodically checking for its completion. This can be more efficient
    than the polling pattern, as it allows the system to respond to I/O
    operations as soon as they are completed, rather than waiting for a
    periodic check.
-   Asynchronous Non-Blocking (Proactor pattern): The Proactor pattern
    is similar to the Reactor pattern, but it is designed to handle
    high-concurrency environments where many I/O operations are
    occurring simultaneously. It uses a combination of asynchronous I/O
    and event-driven design to allow for efficient handling of multiple
    I/O operations at once.

Questions for I/O Models

-   Can it deal with all kinds of input/output?
-   How are synchronous channels integrated?
-   How hard is programming?
-   Can it be combined with multi-cores?
-   Scalability through multi-processes?
-   Race conditions possible?

Message Protocols

Delivery Guarantees

The Role of Delivery Guarantees

Shop order: The scenario described involves an online shop in which
orders are placed and processed. The goal is to ensure that orders are
delivered correctly and efficiently, regardless of any potential issues
that may arise.

-   TCP Communication properties: TCP (Transmission Control Protocol) is
    a networking protocol that is used to establish and maintain
    communication between devices over a network. It has several key
    properties that are relevant to the scenario described, including
    reliability, flow control, and congestion control.
-   At-least-once: The “at-least-once” delivery guarantee means that a
    message may be delivered more than once, but it will always be
    delivered at least once. This can be useful in situations where it
    is more important to ensure that a message is delivered, even if it
    may be duplicated, than it is to prevent duplicates from occurring.
-   At-most-once: The “at-most-once” delivery guarantee means that a
    message will be delivered at most once. This can be useful in
    situations where it is more important to prevent duplicates from
    occurring than it is to ensure that a message is always delivered.
-   Exactly once: The “exactly once” delivery guarantee means that a
    message will be delivered exactly once, with no duplicates. This can
    be more difficult to achieve than the other delivery guarantees, as
    it requires additional complexity and overhead to ensure that
    duplicates are prevented.
-   Message complexity: The number of messages sent refers to the total
    number of messages that are transmitted as part of the communication
    process. In the scenario described, the number of messages sent may
    affect the efficiency and reliability of the communication process,
    and may need to be taken into account when determining the
    appropriate delivery guarantee to use.

Why is TCP not Enough?

While TCP (Transmission Control Protocol) is a widely used networking
protocol that provides a reliable communication channel between devices,
it is not always sufficient on its own to ensure proper behavior in all
situations. Here are some reasons why TCP may not be enough:

-   Lost messages retransmitted: TCP includes mechanisms for
    retransmitting lost messages, which can help to improve the
    reliability of communication. However, if messages are frequently
    lost or the network is particularly unreliable, the overhead of
    retransmitting lost messages may become a burden on the system.
-   Re-sequencing of out of order messages: TCP includes mechanisms for
    reordering out-of-order messages, which can help to ensure that
    messages are delivered in the correct order. However, if messages
    are frequently delivered out of order, this can be inefficient and
    may cause issues with the overall communication process.
-   Sender choke back (flow control): TCP includes flow control
    mechanisms that allow the sender to adjust its rate of transmission
    based on the capacity of the receiver. However, if the sender is
    sending messages too quickly, this can lead to congestion on the
    network and reduced performance.
-   No message boundary protection: TCP does not provide any protection
    for message boundaries, which means that messages may be broken up
    or combined during transmission. This can make it difficult to
    ensure that messages are delivered in their entirety and can cause
    issues with the overall communication process.
-   Timeout problem: TCP includes mechanisms for detecting and handling
    connection failures, but these mechanisms may not be sufficient in
    all situations. For example, if a connection is lost due to a
    timeout, it may take some time to detect and recover from the
    failure, which can lead to delays and disruptions in the
    communication process.

To address these and other issues, it may be necessary to use additional
protocols or techniques, such as message-oriented middleware or
application-level protocols, to ensure proper behavior in case of
connection failures and to provide additional features and guarantees
for message delivery.

Different Levels of Timeouts

-   Business-Process-Timeout: This timeout is set at the business
    process level and is used to ensure that a business process does not
    get stuck or take too long to complete. This timeout may be
    triggered if a particular task or operation within the process takes
    longer than expected to complete, or if the process as a whole takes
    too long to finish.
-   RPC-Timeout (order progress): This timeout is set at the level of
    remote procedure calls (RPCs) and is used to ensure that RPCs do not
    get stuck or take too long to complete. This timeout may be
    triggered if an RPC takes longer than expected to complete, or if
    the progress of an RPC is not being monitored properly.
-   TCP-Timeout (reliable channel): This timeout is set at the TCP
    (Transmission Control Protocol) level and is used to ensure that the
    reliable communication channel provided by TCP is functioning
    properly. This timeout may be triggered if a connection is lost or
    if the channel becomes congested or otherwise unstable.

Delivery Guarantees for RPCs

-   Best effort: The “best effort” delivery guarantee means that no
    specific guarantees are made about the delivery of requests or
    responses. This means that requests may be lost or responses may not
    be received, and there is no mechanism in place to ensure that this
    does not happen.
-   At least once: The “at least once” delivery guarantee means that a
    request may be delivered more than once, but it will always be
    delivered at least once. This can be useful in situations where it
    is more important to ensure that a request is delivered, even if it
    may be duplicated, than it is to prevent duplicates from occurring.
-   At most once: The “at most once” delivery guarantee means that a
    request will be delivered at most once. This can be useful in
    situations where it is more important to prevent duplicates from
    occurring than it is to ensure that a request is always delivered.
-   Once and only once/exactly once: The “once and only once” or
    “exactly once” delivery guarantee means that a request will be
    delivered exactly once, with no duplicates. This can be more
    difficult to achieve than the other delivery guarantees, as it
    requires additional complexity and overhead to ensure that
    duplicates are prevented.

Idempotency

Overview

Idempotency is a property of operations or requests that ensures that
they can be safely repeated without changing the result of the
operation. In other words, if an operation is idempotent, it will have
the same result whether it is performed once or multiple times.

-   The first request needs to be idempotent: In a sequence of requests,
    it is important that the first request is idempotent. This ensures
    that the first request can be safely repeated if it fails or is
    lost, without affecting the overall result of the operation.
-   The last request can be only best effort
-   Messages may be reordered

Server State and Idempotency

Idempotency is an important property to consider when designing
operations or requests that may be repeated or delivered multiple times,
as it can help to ensure that the operation or request is able to be
safely repeated without affecting the overall result. Here are some
additional considerations related to idempotency and server state:

-   No need to remember a request and its result: If an operation or
    request is idempotent, the server does not need to remember the
    request or its result. This can be useful in situations where the
    server’s storage is limited or unreliable, as it means that the
    server does not need to maintain a record of all previous requests
    and their results.
-   Server can lose its storage: If the server’s storage is lost or
    becomes unavailable, it should not affect the overall result of the
    operation or request, as long as the operation or request is
    idempotent. This can help to ensure that the operation or request is
    able to be safely repeated even if the server’s storage is lost.
-   Concurrent updates might be consistent without concurrency control:
    If an operation or request is idempotent, it is possible that
    concurrent updates to the same data may be consistent without the
    need for concurrency control mechanisms such as locks or
    transactions. This can help to improve the efficiency and
    performance of the system.

Implementing Delivery Guarantees for Idempotent Requests

-   “At least once” implementation for idempotent requests: For
    idempotent requests, the “at least once” delivery guarantee can be
    implemented by simply sending an acknowledgement (ack) to the client
    after the request has been received. This approach does not require
    any updates to the server state, and is suitable for requests that
    do not have any critical side effects.
-   “At most once” implementation for nonidempotent requests: For
    nonidempotent requests, the “at most once” delivery guarantee can be
    implemented by storing a response on the server until the client
    confirms that it has been received. This approach requires the
    server to maintain state for each response, and may involve adding a
    request number to each request to help the server detect and discard
    duplicate requests.
-   “Exactly once” implementation: The “exactly once” delivery guarantee
    is not possible to achieve in asynchronous systems with network
    failures. However, it can be approximated using techniques such as
    two-phase commit and epoch numbers, which allow the client and
    server to coordinate their actions and ensure that they do not
    forget their decisions. This approach may involve maintaining an
    atomic log on both the client and server, and storing responses on
    the server until they are confirmed by the client

Repeating Non-Idempotent Operations

If an operation is not idempotent, it means that it cannot be safely
repeated multiple times and is likely to have unintended side effects.
In this case, there are several measures that can be taken to ensure
reliable communication:

-   Use a message ID to filter for duplicate sends: By including a
    unique message ID in each request, the server can filter out
    duplicates and only execute the request once.
-   Keep a history list of request execution results on the server: If
    the reply to a request is lost, the server can retransmit the result
    from its history list. This helps to ensure that the client receives
    the correct result even if the initial reply was lost.
-   Lease resources on the server: In some cases, it may be necessary to
    keep state on the server in order to facilitate communication. For
    example, a client may “lease” resources on the server for a specific
    period of time. This can help to ensure that resources are used
    efficiently and released when they are no longer needed.

It is important to carefully manage state on the server in order to
avoid overloading the system with old replies. It may be necessary to
set limits on how long to store old replies and when to discard them.

Order

Request Order in Multi-Point-Protocols

-   No request order from one sender: In a multi-point protocol, there
    is no guaranteed order for requests sent by a single sender. This
    means that requests may be received and processed in a different
    order than they were sent, and the sender should be prepared to
    handle this possibility.
-   No request order between different senders: In a multi-point
    protocol, there is no guaranteed order for requests sent by
    different senders. This means that requests from different senders
    may be received and processed in a different order than they were
    sent, and the senders should be prepared to handle this possibility.
-   No request order between independent requests of different senders:
    In a multi-point protocol, there is no guaranteed order for
    independent requests sent by different senders. This means that
    independent requests from different senders may be received and
    processed in a different order than they were sent, and the senders
    should be prepared to handle this possibility.

Request Ordering with Multiple Nodes

In a multi-node system, it may be necessary to use a reliable broadcast
protocol to ensure that requests are processed in the desired order.
Here are some examples of protocols that can be used for request
ordering with multiple nodes:

-   Reliable Broadcast: Reliable broadcast is a protocol that ensures
    that a message is delivered to all nodes in the system, and that it
    is delivered in the same order to all nodes. This can help to ensure
    that requests are processed in the correct order, even if they are
    sent from different nodes or if there are delays or other issues
    with the network.
-   FIFO Cast: FIFO cast is a protocol that ensures that messages are
    delivered in the order in which they were sent, with the first
    message sent being the first one to be delivered. This can help to
    ensure that requests are processed in the correct order, even if
    they are sent from different nodes or if there are delays or other
    issues with the network. This can still lead to causal
    inconsistencies!
-   Causal Cast: Causal cast is a protocol that ensures that messages
    are delivered in a causally consistent order, based on the
    dependencies between the messages. This can help to ensure that
    requests are processed in the correct order, even if they are sent
    from different nodes or if there are delays or other issues with the
    network.
-   Absolutely Ordered Casts: Absolutely ordered casts is a protocol
    that ensures that messages are delivered in a totally ordered
    sequence, with no uncertainty about the order in which the messages
    were sent. This can help to ensure that requests are processed in
    the correct order, even if they are sent from different nodes or if
    there are delays or other issues with the network.

Implementing Causal Ordered Broadcasts

-   Piggybacking previous messages: One solution for implementing causal
    ordered broadcasts is to piggyback every message sent with the
    previous messages. This means that when a message is sent, it is
    accompanied by the previous messages that it depends on. This can
    help to ensure that processes that may have missed a message can
    learn about it with the next incoming message and then deliver it
    correctly.
-   Sending event history with every message: Another solution for
    implementing causal ordered broadcasts is to send the event history
    with every message. This can be done using techniques such as vector
    clocks, which are used to track the dependencies between events in a
    distributed system. With this approach, messages are not delivered
    until the order is correct. This can help to ensure that messages
    are delivered in the correct order, even if there are delays or
    other issues with the network.

Implementing Absolutely Ordered Casts

-   All nodes send messages to every other node: One solution for
    implementing atomic broadcasts is for all nodes to send their
    messages to every other node in the system. This ensures that all
    nodes have a complete set of messages, which can be used to
    determine the total order of the messages.
-   All nodes receive messages, but wait with delivery: After receiving
    all of the messages, all nodes can wait with delivery until the
    total order of the messages has been determined.
-   One node is selected to organize the total order: To determine the
    total order of the messages, one node can be selected to organize
    the messages into a total order. This node can use a variety of
    techniques, such as vector clocks or Lamport timestamps, to
    determine the order of the messages.
-   The node sends the total order to all nodes: Once the total order
    has been determined, the node can send the total order to all other
    nodes in the system.
-   All nodes receive the total order and deliver their messages:
    Finally, all nodes can receive the total order and deliver their
    messages according to the determined order.

There are however disadvantages with these implementations:

-   May have high overhead: This solution may have high overhead, as it
    requires all nodes to send and receive messages from every other
    node in the system. This can be particularly problematic in large
    systems with many nodes, as it may result in many messages being
    transmitted and processed.
-   May have high latency: This solution may also have high latency, as
    it requires all nodes to wait for the total order to be determined
    before delivering their messages. This can be particularly
    problematic in systems where low latency is critical.

Reliable Messaging

Reliable B2B (Business-to-Business) messages require the following
qualities:

-   Guaranteed delivery (acknowledgement enforced)
-   Duplicate removal (using message ID)
-   Message ordering (using sequence numbers)

SOAP and HTTP partially achieve this like so:

1.  The first application layer exchanges persistent messages with the
    requester.
2.  The requester sends a SOAP message with a message ID, sequence
    number, and QoS (Quality of Service) tag to the receiver.
3.  The receiver must send an acknowledgement.
4.  The receiver exchanges persistent messages with the second
    application layer.

Transaction Models

Atomic transactions:

-   Are not nested (standalone)
-   Are short
-   Involve a tightly coupled business task
-   Can be rolled back in case of error
-   Can be disrupted by system crashes

Activity transactions:

-   Involve nested tasks
-   Are long-running
-   Involve a loosely coupled business activity
-   Include compensating tasks and activities to address errors
-   Can be disrupted by errors such as order cancellations

Theoretical Foundations of Distributed Systems

Foundational Concepts

The Eight Fallacies of Distributed Computing

-   The network is reliable: This fallacy assumes that the network will
    always be available and free of errors or failures, which is not
    always the case.
-   Latency is zero: This fallacy assumes that communication between
    nodes in a distributed system will be instantaneous, but in reality,
    there is always some latency due to the time it takes for a request
    to be processed and a response to be received.
-   Bandwidth is infinite: This fallacy assumes that there is unlimited
    bandwidth available for communication between nodes, but in reality,
    bandwidth can be limited by factors such as network congestion or
    hardware limitations.
-   The network is secure: This fallacy assumes that the network is
    completely secure and free from threats such as hackers or malicious
    software, but this is not always the case.
-   Topology doesn’t change: This fallacy assumes that the topology of
    the network, or the way that nodes are connected, will remain
    constant, but in reality, the topology can change due to factors
    such as node failures or changes in network configuration.
-   There is one administrator: This fallacy assumes that there is only
    one person or entity responsible for managing the network, but in
    distributed systems, there may be multiple administrators or
    stakeholders with different roles and responsibilities.
-   Transport cost is zero: This fallacy assumes that there are no costs
    associated with communication between nodes, but in reality, there
    may be costs such as network fees or hardware expenses.
-   The network is homogeneous: This fallacy assumes that all nodes in
    the network are identical and operate in the same way, but in
    reality, nodes can have different hardware, software, and
    configurations.

Analyzing Latency

-   Know the long-term trends in hardware: Latency can be influenced by
    the performance of the hardware being used, so it’s important to be
    aware of trends in hardware development and how they may impact
    latency.
-   Understand the problem of deep queuing networks and the solutions:
    Deep queuing occurs when there are many requests waiting to be
    processed, leading to longer latency. Understanding this problem and
    implementing solutions such as load balancing can help reduce
    latency.
-   Know your numbers with respect to switching times, router delays,
    round-trip times, IOPS per device, and perform “back of the
    envelope” calculations: It’s important to have a good understanding
    of the specific numbers and metrics related to latency in your
    system, such as switching times and router delays, and to perform
    calculations to estimate the impact of these factors on latency.
-   Understand buffering effects on latency: Buffering, or the temporary
    storage of data, can impact latency by adding additional processing
    time. Understanding the effects of buffering on latency can help you
    optimize your system to minimize this impact.
-   Include the client side in your calculations: Latency is often
    impacted by factors on the client side, such as the client’s
    hardware and network connection. It’s important to consider these
    factors when calculating latency and optimizing your system to
    minimize it.

Characteristics of Distributed Systems

-   High complexity: Distributed systems involve a large number of
    interacting agents, such as servers, clients, and network devices,
    which can make them complex to design, build, and maintain.
-   Partial knowledge: In distributed systems, each node or agent
    typically has only partial knowledge about the state of the system,
    the actions of other nodes, and the current time. This can make it
    difficult to coordinate actions and ensure consistency across the
    system.
-   Uncertainty: Distributed systems are prone to uncertainty due to
    factors such as node failures, network delays, and changes in the
    system’s environment. This uncertainty can make it challenging to
    predict the behavior of the system and ensure its reliability.

Consistency

Liveness vs. Correctness

Correctness and liveness are two important properties of distributed
systems that ensure they function as intended and make progress.

-   Correctness refers to the property that ensures that a system will
    not exhibit undesirable behaviors, such as incorrect results or
    incorrect behavior. It can be thought of as the absence of bad
    things happening in the system.

-   Liveness, on the other hand, refers to the property that ensures
    that a system will eventually make progress and achieve its intended
    goals. It can be thought of as the presence of good things happening
    in the system.

Both correctness and liveness are based on assumptions about failures
and other conditions in the system, such as fairness and the presence of
Byzantine errors. Ensuring that a distributed system has both
correctness and liveness is critical for its success.

Liveness and Correctness in Practice

Here is an example of how correctness and liveness can be defined for an
event-based system:

Correctness:

-   Receive notifications only if subscribed to them: This ensures that
    a node only receives notifications for events it is interested in.
-   Received notifications must have been published before: This ensures
    that notifications are not received before they have been published,
    which would lead to incorrect behavior.
-   Receive a notification only at most once: This ensures that a node
    does not receive the same notification multiple times, which could
    lead to incorrect behavior.

Liveness:

-   Start receiving notifications some time after a subscription was
    made: This ensures that a node will eventually start receiving
    notifications after it has subscribed to them.

Failure Assumptions: Fail-stop model with fairness

In this example, the system is designed to ensure correctness by
limiting the notifications a node receives to those it is subscribed to
and ensuring that notifications are received only once. It is designed
to ensure liveness by ensuring that a node will eventually start
receiving notifications after subscribing. The system also makes
assumptions about failures, such as the fail-stop model with fairness,
which are used to ensure the correctness and liveness of the system.

Timing Models

In distributed systems, timing models refer to the way that events and
communication between nodes are synchronized. There are three main types
of timing models: synchronous, asynchronous, and partial synchronous.

-   Synchronous timing models: In synchronous timing models, transmit
    times are strictly defined and events happen at specific moments.
    Nodes in a synchronous system can immediately detect when another
    node has crashed because the system relies on a clock to synchronize
    events. Examples of synchronous systems include CPUs and other types
    of hardware.
-   Asynchronous timing models: In asynchronous timing models, there is
    no exact time between sending and receiving messages. Messages will
    “eventually” arrive, but there is no guarantee about when. Because
    there are no strict timing constraints, a node in an asynchronous
    system cannot tell whether another node has crashed or is simply
    very slow to respond. There are no timeouts in asynchronous systems
    because they would require a clock.
-   Partial synchronous timing models: Partial synchronous timing models
    are a combination of synchronous and asynchronous models. These
    systems are asynchronous, but they are enhanced with local clocks
    that provide some level of synchronization. This is the model that
    is typically used for real-world distributed systems, as it allows
    for the flexibility of asynchronous communication while still
    providing some guarantees about timing.

The Fischer, Lynch and Patterson Result

-   The FLP (Fischer, Lynch, and Patterson) result is a theoretical
    result that shows it is impossible to reach consensus in
    asynchronous distributed systems in certain circumstances.
-   The result has significant implications for distributed algorithms
    that rely on consensus, such as leader election, agreement,
    replication, locking, and atomic broadcast.
-   The problem is caused by the need for a unique leader to make a
    decision, but the asynchronous nature of the system can lead to
    delays and the re-election of new leaders, which can indefinitely
    delay the decision-making process.
-   This problem affects most consensus-based distributed algorithms and
    can result in non-terminating runs where no decision is reached.

The CAP Theorem

Overview

States that in the presence of network partitions, a client must choose
either consistency or availability, but not both.

-   Choosing consistency: If the client chooses consistency, they may
    not get an answer at all.
-   Choosing availability: If the client chooses availability, they may
    receive a potentially incorrect answer.

Preconditions for the CAP Theorem

-   To be considered consistent, the system must ensure that all
    operations are atomic and linearizable, meaning that they can be
    thought of as occurring at a single instant in time and have a total
    order.
-   For a system to be considered available, it must ensure that every
    request received by a non-failing node is met with a response, and
    that every request terminates.
-   Partition tolerance: The network will be allowed to lose arbitrarily
    many messages sent from one node to another.

Common Misconceptions of the CAP Theorem

-   Consistency: Many systems do not achieve a total order of requests
    due to the costs (latency, partial results) involved.
-   Availability: Even an isolated node with a working quorum on the
    other side must answer requests, breaking consistency. The node does
    not know that a quorum exists.
-   Partition Tolerance: You cannot un-choose partition tolerance, as it
    is always present. CA systems are therefore not possible.

The Modern View of the CAP Theorem

-   There are more failure types than just partition tolerance, such as
    host-crash and client-server disconnect. These failures cannot be
    completely avoided.
-   Many systems do not need linearizability, and it is important to
    carefully consider the type of consistency that is needed.
-   Most systems prioritize latency over consistency, with availability
    coming in second.
-   A fully consistent system in an asynchronous network is impossible
    (in the sense of FLP). FLP is much stronger than CAP.
-   The architecture of the system (replication, sharding) and the
    abilities of the client (failover) also have an impact on the
    system’s behavior.

PACELC

PACELC: A more complete portrayal of the space of potential consistency
tradeoffs for distributed database systems.

-   In the presence of a partition (P), the system must trade off
    availability and consistency (A and C).
-   In the absence of a partition (E), the system can trade off
    latency (L) and consistency (C) when running normally.

Failure

Technical Failures

-   Network failures, such as partitioning, which can occur when a
    network is divided into smaller, separate networks that are unable
    to communicate with each other.
-   CPU/Hardware failures, such as instruction failures or RAM failures,
    which can occur when the hardware components of a system malfunction
    or fail.
-   Operating system failures, such as crashes or reduced function,
    which can occur when the operating system experiences an error or
    malfunction.
-   Application failures, such as crashes, stopped states, or partially
    functioning states, which can occur when an application experiences
    an error or malfunction.

Unfortunately, in most cases there is no failure detection service that
can identify when these failures occur and allow others to take
appropriate action. However, it is possible to develop a failure
detection service that could detect even partitioning and other types of
failures through the use of MIBs (Management Information Bases) and
triangulation. Applications could also be designed to track themselves
and restart if necessary.

Failure Types

-   Bohr-Bug:
    -   Shows up consistently and can be reproduced. Easy to recognize
        and fix.
-   Heisenbug:
    -   Shows up intermittently, depending on the order of execution.
    -   High degree of non-determinism and context dependency.
    -   Due to complex IT environments, they are both more frequent and
        harder to solve.
    -   They are only symptoms of a deeper problem.
    -   Changes to software may make them disappear temporarily, but
        more changes can cause them to reappear.
    -   Example: Deadlock “solving” through delays instead of resource
        order management.

Failure Models

-   Crash-stop: A process crashes atomically and stays down.
-   Crash-stop with recovery: A process crashes and is down until it
    begins recovery, and is up again until the next crash occurs. For
    consensus, at least 2f+1 machines are needed (quorum).
-   Crash-amnesia: A process crashes and restarts without recollection
    of previous events or data.
-   Failstop: A machine fails completely, and the failure is reported to
    other machines reliably.
-   Omission errors: Processes fail to send or receive messages even
    though they are alive.
-   Byzantine errors: Machines or parts of machines, networks, or
    applications fail in unpredictable ways and may recover partially.
    For consensus, at least 3f+1 machines are needed.

Many protocols for achieving consistency and availability make
assumptions about failure models. For example, transaction protocols may
assume recovery behavior by its participants if the protocol terminates.

Failures and Timeouts

-   Timeouts are not a reliable way to detect failures in distributed
    systems because they can be caused by various factors, such as short
    interruptions on the network, overload conditions, and routing
    changes.
-   Timeouts cannot distinguish between different types and locations of
    failures.
-   Timeouts cannot be used in protocols that require failstop behavior
    of its participants.
-   Most distributed systems only offer timeouts for applications to
    notice problems, so they do not provide detailed information about
    the state of participants or membership.
-   Using timeouts can result in “split-brain” conditions, where a
    system behaves as if it is functioning properly but is actually
    experiencing a failure or malfunction.

Failure Detectors

A failure detector (FD) is a mechanism used in distributed systems to
detect failures of processes or machines. It is not required to be
correct all the time, but it should provide the following quality of
service (QoS):

-   Safety: The FD should be safe all the time, meaning it should not
    falsely suspect processes of being faulty during “better” failure
    periods.
-   Liveness: The FD should be live during “better” failure periods,
    meaning no process should block forever waiting for a message from a
    dead coordinator.
-   Accuracy: Eventually, some process x should not be falsely suspected
    of being faulty. When x becomes the coordinator, every process
    should receive x’s estimate and make a decision based on it.
-   Low overhead: The FD should not cause a lot of overhead, meaning it
    should not consume too many resources or slow down the system.

Clocks

Time in Distributed Systems

In distributed systems, there is no global time that is shared across
all processes. Instead, different approaches are used to model time in
these systems. These approaches include:

-   Event clock time: This is a logical model of time that represents
    the order of events within a single process.
-   Vector clock time: This is a logical model of time that represents
    the order of events between multiple processes.
-   TrueTime: This is a physical model of time that represents the
    interval of time between events.
-   Augmented time: This is a combination of physical and logical models
    of time that takes into account both the interval of time between
    events and the order of events.

Logical time is modeled as partially ordered events within a process or
between processes. It is used to represent the relative order of events
in a distributed system, rather than the actual clock time at which they
occurred.

Causal meta-data in the system can also order the events properly.

Consistent vs. Inconsistent Cuts

-   Consistent cuts: These cuts produce causally possible events,
    meaning that events occur in a logical and possible order.
-   Inconsistent cuts: These cuts produce events that arrive before they
    have been sent, resulting in an illogical or impossible order.

Event Clocks (Logical Clocks)

-   Event clocks, also known as logical clocks, are systems for ordering
    events within processes according to a chosen causal model and
    granularity.
-   Events are partially ordered based on the order in which they occur.
    The time between events is a logical unit of time that has no
    physical extension.
-   Events delivered through messages can be used to relate the
    processes and their times to the events. The external order of these
    events is also a partial order of events between processes (for
    example, the event “send(p1,m)” occurs before the event
    “recv(p2,m)”).
-   The value of the logical clock is updated to the maximum of the
    current value plus one or the received value.

Lamport Logical Clock

-   The Lamport logical clock counts events and creates an ordering
    relation between them. These counters can be used as timestamps on
    events.
-   The ordering relation captures all causally related events, but it
    also includes many unrelated (concurrent) events, which can create
    false dependencies.

Vector Clocks

-   Vector clocks are transmitted with messages and compared at the
    receiving end.
-   If, for all positions in two vector clocks A and B, the values in A
    are larger than or the same as the values from B, we say that Vector
    Clock A dominates B.
-   This can be interpreted as potential causality to detect conflicts,
    as missed messages to order propagation, etc.

Physical Interval Time (TrueTime)

-   TrueTime works by using time servers to check for rogue clocks,
    which are clocks that are not synchronized with the correct time.
-   The error in TrueTime is typically in the range of 6 milliseconds.

Hybrid Clocks

Hybrid clocks are systems that combine elements of both logical and
physical models of time in order to address the limitations of each
approach. There are several reasons why hybrid clocks may be used in
distributed systems:

-   In large distributed systems, such as those spanning multiple data
    centers across the world, vector clocks can become too large to
    maintain efficiently. Hybrid clocks can be used to reduce the size
    of the clocks while still maintaining an accurate ordering of
    events.
-   Physical interval time, such as TrueTime, requires that reads and
    writes wait until the interval time is over on all machines. This
    can be inefficient in some cases, and hybrid clocks can be used to
    allow for more flexibility in terms of when reads and writes can
    occur.

Ordering in Distributed Event Systems

-   FIFO (first-in, first-out) ordering: This refers to the requirement
    that a component must receive notifications in the order in which
    they were published by the publisher.
-   Causal ordering: This refers to the requirement that events must be
    ordered based on their causal relationships, as defined by the
    system.
-   Total ordering: This refers to the requirement that events must be
    ordered in a specific way, such that no other component in the
    system is allowed to receive an event before a preceding event has
    been received. One component may be responsible for deciding the
    global order of events in this case.

Consensus

Overview

Consensus is a process used by a group of processes to reach agreement
on a specific value, based on their individual inputs. The objective of
consensus is for all processes to decide on a value v that is present in
the input set.

-   Termination: Every correct process eventually decides on some value.
-   Validity: If a process decides on a value v, then v was proposed by
    some process.
-   Integrity: No process decides on a value more than once.
-   Agreement: No two correct processes decide on different values.

Algorithms for Consensus

These protocols offer trade-offs in terms of correctness, liveness
(availability and progress), and performance:

-   Two-Phase Commit (2PC): This algorithm is used to ensure that a
    group of processes all commit to the same decision. It involves two
    phases: a prepare phase, in which the processes prepare to commit to
    a decision, and a commit phase, in which they actually commit to the
    decision.
-   Static Membership Quorum: This algorithm is based on the concept of
    quorum, which refers to a minimum number of processes that must be
    present in order to reach a decision. The static membership quorum
    algorithm is used to achieve consensus in systems with a fixed
    number of processes.
-   Paxos: This algorithm is used to achieve consensus in distributed
    systems with a dynamic membership. It involves multiple rounds of
    voting in order to reach a decision.
-   Raft: This algorithm is similar to Paxos, but it is designed to be
    easier to understand and implement.
-   Dynamic Group Membership: This class of algorithms is used to
    achieve consensus in systems with a dynamic membership. These
    algorithms include virtual synchrony and multicast-based approaches.
-   Gossip Protocols: These algorithms are used to disseminate
    information between processes in a distributed system. They can be
    used to achieve consensus by allowing processes to exchange
    information and reach agreement on a decision.

Two-Phase Commit (2PC)

Steps:

1.  Preparation phase: In this phase, the processes prepare to commit to
    a decision. Each process sends a “prepare to commit” message to a
    coordinator process, which is responsible for coordinating the
    decision-making process.
2.  Decision phase: Once all the processes have prepared to commit, the
    coordinator process sends a “commit” message to all the processes.
    This message instructs the processes to commit to the decision.
3.  Confirmation phase: Each process sends a “commit confirmation”
    message to the coordinator process, indicating that it has
    successfully committed to the decision.
4.  Finalization phase: Once all the processes have confirmed that they
    have committed to the decision, the coordinator process sends a
    “commit complete” message to all the processes, indicating that the
    decision has been successfully made.

Example:

1.  Imagine that there are three processes in a distributed system: A,
    B, and C.
2.  The coordinator process is A.
3.  The processes are deciding whether to commit to a new software
    update.
4.  In the preparation phase, A sends a “prepare to commit” message to B
    and C.
5.  B and C send “prepare to commit” messages back to A.
6.  In the decision phase, A sends a “commit” message to B and C.
7.  In the confirmation phase, B and C send “commit confirmation”
    messages back to A.
8.  In the finalization phase, A sends a “commit complete” message to B
    and C, indicating that the decision to commit to the software update
    has been successfully made.

Liveness and Correctness:

-   The two-phase commit protocol (2PC) allows for atomic (linearizable)
    updates to be made by participating processes.
-   2PC is considered relatively expensive in terms of latency due to
    the requirement for a persistent log at each participant.
-   A crash failure of the coordinator or a participant can halt or
    disrupt the execution of 2PC.
-   If everything goes as planned, 2PC has a clear and
    easy-to-understand semantic for application developers.

Quorum-Based Consensus

Steps:

1.  A decision is proposed by one of the processes in the distributed
    system.
2.  Each process in the system votes on the proposed decision.
3.  The votes are counted and checked against the quorum requirement.
    The quorum requirement is the minimum number of votes that must be
    received in favor of the decision in order for it to be approved.
4.  If the quorum requirement has been met, the decision is considered
    to have been approved.
5.  The processes move forward with implementing the decision.

Example:

1.  A group of five processes in a distributed system (A, B, C, D,
    and E) are deciding whether to commit to a new software update.
2.  Process A proposes the decision to update the software.
3.  Processes B, C, D, and E all vote on the proposed decision.
4.  The votes are counted and checked against the quorum requirement,
    which is set at three votes.
5.  Since a quorum of three votes has been received in favor of the
    decision, it is considered to have been approved.
6.  The processes move forward with implementing the software update.

Liveness and Correctness:

-   Quorum-based consensus protocols (QP) allow for atomic
    (linearizable) updates to be made by participating processes, but
    this depends on the type of implementation being used (e.g., whether
    read quoras are used).
-   QP are known to be relatively expensive in terms of latency, as even
    simple reads may require a quorum for consistency. To improve
    performance, many QP systems use a leader-based approach, which
    involves routing client requests through a designated leader
    process.
-   Leader crashes are usually handled through the use of leases, which
    can cause delays.
-   In the case of a crash failure, QP can continue as long as a quorum
    is still possible.
-   Network partitions may cause the system to either respond with all
    non-failing nodes (which may sacrifice consistency) or to stop
    responding to requests from minority nodes (which may sacrifice
    availability).
-   QP offer a reliable and efficient method for achieving consensus in
    distributed systems, but they also come with some trade-offs in
    terms of performance and fault tolerance.

Raft

RAFT is a distributed consensus protocol that allows a group of
processes (called “replicas”) to agree on a value (“decide”) in the
presence of failures. RAFT is divided into three distinct roles: Leader,
Follower, and Candidate.

The protocol consists of the following steps:

1.  Leader Election:

-   When a replica starts up or its leader fails, it becomes a Candidate
    and initiates an election by sending RequestVote messages to all
    other replicas.
-   If a Follower receives a RequestVote message from a Candidate with a
    higher term, it responds with its vote and updates its term to match
    the Candidate’s term.
-   If a Candidate receives a quorum of votes (more than half of the
    replicas), it becomes the Leader and sends AppendEntries messages to
    all other replicas to replicate its log.

2.  Log Replication:

-   The Leader sends AppendEntries messages to all other replicas to
    replicate its log.
-   If a Follower’s log is missing an entry preceding the one in the
    Leader’s message, it responds with a missing entry error.
-   If the Follower’s log is up-to-date and the Leader’s entry is valid,
    the Follower appends the entry and responds with a success message.
-   If the Leader receives a success message from a quorum of replicas,
    it updates its commit index and sends a Commit message to all other
    replicas to apply the committed entry to their state machines.

3.  State Machine Update:

-   If a replica receives a Commit message, it applies the committed
    entry to its state machine and responds with an Apply message to the
    Leader.
-   If the Leader receives an Apply message from a quorum of replicas,
    it updates its commit index and sends an Apply message to all other
    replicas.

Broadcasting

Atomic Broadcast Conditions

A distributed algorithm that guarantees correct transmission of a
message from a primary process to all other processes in a network or
broadcast domain, including the primary.

It satisfies the following conditions:

-   Validity: If a correct process broadcasts a message, then all
    correct processes will eventually deliver it
-   Uniform Agreement: If a process delivers a message, then all correct
    processes eventually deliver that message
-   Uniform Integrity: For any message m, every process delivers m at
    most once, and only if m was previously broadcast by the sender of m
-   Uniform Total Order: If processes p and q both deliver messages m
    and m0, then p delivers m before m0 if and only if q delivers m
    before m0

It is widely used in distributed computing for group communication and
defined as a reliable broadcast that satisfies total order.

Atomic Broadcast Protocol

Data:

-   Epoch (e): The duration of a specific leadership
-   View (v): Defined membership set that lasts until an existing member
    leaves or comes back
-   Transaction counter (tc): Counts rounds of execution, such as
    updates to replicas

Phases:

1.  Leader election/discovery: Members decide on a new leader and form a
    consistent view of the group.
2.  Synchronization/recovery: Leader gathers outstanding, uncommitted
    requests recorded at members and updates members missing certain
    data until all share the same state.
3.  Working: Leader proposes new transactions to the group, collects
    confirmations, and sends out commits.

-   Frequent leader changes can cause overhead and may be a potential
    denial of service. It is important to consider latency on the leader
    node when implementing an atomic broadcast protocol.
-   Paxos, Raft etc. are Atomic Broadcast Protocols!

Gossip Protocols

Gossip protocols are a class of distributed algorithms that rely on
randomly chosen pairs of nodes in a network to exchange information
about the state of the system. They are typically used for group
membership, failure detection, and dissemination of information.

There are several key characteristics of gossip protocols:

-   Randomized: Gossip protocols rely on randomly chosen pairs of nodes
    to exchange information, which helps to reduce the risk of
    overloading any particular node.
-   Scalability: Gossip protocols scale well in large, distributed
    systems because they only require communication with a few nodes at
    a time.
-   Fault tolerance: Gossip protocols are designed to tolerate failures
    and can continue to operate even if some nodes go down.
-   Asynchronous: Gossip protocols do not rely on a central authority or
    global clock, so they can operate asynchronously in a distributed
    system.

DWAL

A DWAL (Distributed Write-Ahead-Log) is a data structure that is used to
ensure that updates to a distributed system are stored in a way that
allows them to be recovered in case of system failure. It is a type of
write-ahead log, which means that updates are written to the log before
they are applied to the system’s state. This allows the updates to be
replayed in the correct order after a system failure.

Design Components of DWALs

-   Global visibility: Replicated state should be visible to all
    processes in the system. This can be achieved through the use of
    atomic broadcast or other consensus protocols to ensure that all
    processes have a consistent view of the system state.
-   Consensus protocol: A consensus protocol such as Paxos or Raft is
    used to ensure that all processes agree on the order of updates to
    the replicated state. This ensures that all processes have a
    consistent view of the system state and reduces the risk of
    conflicts or data loss.
-   Majority decisions: In a consensus protocol-based system, majority
    decisions are used to ensure that the system can make progress even
    in the presence of failures. This means that a majority of processes
    must agree on the order of updates to the replicated state before
    they can be applied.
-   Group membership: In order to ensure that the DWAL can function
    properly, it is important to have a mechanism in place for
    maintaining an up-to-date view of the membership of the group of
    processes.
-   Message order and latency hiding: To ensure that the DWAL can
    function effectively, it is important to ensure that updates are
    delivered to all processes in a consistent order.

Replication

Properties of Replication Models

-   Who is responsible for updates: Single master or multiple masters?
-   What is being updated: State transfer or operation transfer?
-   How updates are ordered
-   Conflict detection and resolution
-   Method for updating replica nodes
-   Guarantees for divergence

Single-Leader Replication

Steps:

1.  Client sends x=5 to Node1 (master)
2.  Master updates node 2 and node 3 (followers)
3.  Client receives changed value (or old value; due to replication lag)

Advantages:

-   Ordered updates
-   Efficient caching
-   Highly available reads

Disadvantages:

-   Replicas may be out of sync with the master
-   Leader crash may cause problems
-   Followers may take a while to take over in case of leader failure
-   Not suitable for critical resources such as primary keys

Eventually Consistent Reads

Eventual consistency model: Allows for a certain level of lag between
updates to be propagated to all replicas

Steps:

1.  Client updates value on Master-Replica node
2.  Master-Replica eventually propagates update to Slave replica
3.  Client performs a stale read from client node, potentially returning
    outdated value

Multi-Master Replication

Multi-Master Replication (MMR) is a type of replication in which
multiple servers can accept write requests, allowing any server to act
as a master. This means that updates can be made to any server, and the
changes will be replicated to all other servers in the network. MMR can
be used to improve the availability and scalability of asystem, as it
allows updates to be made to any server and allows multiple servers to
handle write requests.

It also introduces the possibility of conflicts, as multiple servers may
receive updates to the same data simultaneously. To resolve these
conflicts, MMR systems typically use conflict resolution algorithms
(last writer wins, keeping different versions, anti-entropy background
merge/resolve) or allow the user to manually resolve conflicts.

Steps:

1.  Client 1 writes x=5 to Node 1 (master)
2.  Client 2 writes x=10 to Node 2 (master)
3.  The masters detect a conflict

Conflict types:

-   Update conflict: Two identical rows changed on two servers
-   Uniqueness conflict: Two identical primary (uniqe) keys added in
    same table on two servers
-   Delete conflict: During delete of a row the same row is changed on a
    different server.

Leaderless Quorum Replication

Write:

-   In a leader-less (quorum) replication system, the client decides how
    many machines to write to or read from using the formula W+R>N,
    where N is the number of machines in the replication group.
-   Without a designated leader, quorum systems may suffer from long
    tail effects.
-   If a quorum is not available, the client can choose to write to a
    “sloppy quorum” and risk the write being lost.
-   Without anti-entropy, there is a high risk of partial writes in the
    system, which can lead to inconsistencies and can be difficult to
    clean up.

Read:

-   Some systems may detect inconsistencies during a read operation.
-   These systems can either automatically perform a cleanup (e.g. using
    version numbers to return the correct value) or offer both values
    for the client to choose from.

Session Modes of Asynchronous Replication

The following guarantees seem to enable “sequential consistency” for a
specific client, meaning that the program order of updates from this
client is respected by the system. Clients can track these guarantees
using vector clocks:

-   “Read your writes” (RYW) ensures that the contents read from a
    replica include previous writes by the same user.
-   “Monotonic reads” (MR) ensures that successive reads by the same
    user return increasingly up-to-date contents.
-   “Writes follow reads” (WFR) ensures that a write operation is
    accepted only after writes observed by previous reads by the same
    user are incorporated in the same replica.
-   “Monotonic writes” (MW) ensures that a write operation is accepted
    only after all write operations made by the same user are
    incorporated in the same replica.

We can also derive session anomalies from this:

-   Non-monotonic reads: Reads that do not return increasingly
    up-to-date contents.
-   Non-monotonic writes: Write operations that are not accepted in the
    order that they were made by the same user.
-   Non-monotonic transactions: Transactions that do not preserve the
    order in which they were made by the same user.
-   Not-reading-my-writes: Reads that do not include previous writes by
    the same user.

Global Modes of Replication

There are multiple different modes to choose from:

-   Strong consistency ensures that all previous writes are visible, and
    is characterized by the following properties:
    -   Ordered: Writes are accepted in the order that they were made.
    -   Real: All writes are visible.
    -   Monotonic: Write operations are accepted only after all previous
        write operations made by the same user are incorporated in the
        same replica.
    -   Complete: All writes are included.
-   Consistent prefix ensures that an ordered sequence of writes is
    visible, and is characterized by the following properties:
    -   Ordered: Writes are accepted in the order that they were made.
    -   Real: All writes are visible.
    -   X latest missing: Some of the latest writes may be missing.
    -   Snapshot isolation-like: The system behaves like snapshot
        isolation, where writes made by a transaction are not visible to
        other transactions until the transaction is committed.
-   Bounded staleness ensures that all writes older than X or every
    write except the last Y are visible, and is characterized by the
    following properties:
    -   Ordered: Writes are accepted in the order that they were made.
    -   Real: All writes are visible.
    -   X latest missing: Some of the latest writes may be missing.
    -   Monotonic increasing due to bound: Write operations are accepted
        only after all previous write operations made by the same user
        are incorporated in the same replica, but the bound on staleness
        allows for some deviation from strict monotonicity.
-   Eventual consistency ensures that a subset of previous writes is
    visible, and is characterized by the following properties:
    -   Unordered: Writes may not be accepted in the order that they
        were made.
    -   Un-real: Some writes may not be visible.
    -   Incomplete: Some writes may be missing.

Each of them have their own trade-offs:

-   Strong consistency:
    -   Consistency: Excellent
    -   Performance: Poor
    -   Availability: Poor
-   Eventual consistency:
    -   Consistency: Poor
    -   Performance: Excellent
    -   Availability: Excellent
-   Consistent prefix:
    -   Consistency: Okay
    -   Performance: Good
    -   Availability: Excellent
-   Bounded stalenes:
    -   Consistency: Good
    -   Performance: Okay
    -   Availability: Poor
-   Monotonic reads:
    -   Consistency: Okay
    -   Performance: Good
    -   Availability: Good
-   Read my writes:
    -   Consistency: Okay
    -   Performance: Okay
    -   Availability: Okay

Distributed Services and Algorithms I

Types of Distributed Services

What is a Distributed Service?

Function provided by a distributed middleware with:

-   High scalability
-   High availabilit

Services and Instances

-   Distributed systems:
    -   Comprised of services, such as applications, databases, caches,
        etc.
    -   Services are made up of instances or nodes, which are
        individually addressable hosts (physical or virtual)
-   Key observation:
    -   Unit of interaction is at the service level, not the instance
        level
    -   Concerned with logical groups of nodes, not specific instances
    -   Example: Interacting with a database server, rather than a
        specific database instance.

Core Distributed Services

-   Finding Things
    -   Name Service
    -   Registry
    -   Search
-   Storing Things
    -   Various databases
    -   Data grids
    -   Block storage, etc.
-   Events Handling and asynchronous processing: Queues
-   Load Balancing and Failover
-   Caching Service
-   Locking Things and preventing concurrent access: Lock service
-   Request scheduling and control: Request multiplexing
-   Time handling
-   Providing atomic transactions: Consistency and persistence
-   Replicating Things: NoSQL DBs
-   Object handling: Lifecycle services for creation, destruction,
    relationship service, etc.

Availability

Is defined as:

$Availability = \frac{Uptime_{agreed\ upon} - Downtime_{planned\ and\ unplanned}}{Uptime_{agreed\ upon}}$

Continuous availability does not allow for planned downtime.

Typical Hardware Causes for Downtime

-   Overheating
-   PDU failure
-   Rack-move
-   Network rewiring
-   Rack failures
-   Racks go wonky
-   Network maintenances
-   Router reloads
-   Router failures
-   Individual machine failures
-   Hard drive failures

Availability through Redundancy

Across groups of resources:

-   Multi-site data center
-   Disaster recovery
-   Scalability

Within a group of resources:

-   High availability (HA)
-   Clustering
-   Centralized administration (CA)
-   Automatic failover (CO)
-   Scalability
-   Data replication
-   Quorum algorithms (require multiple machines)

Between two resources:

-   High availability (HA)
-   Centralized administration (CA)
-   Automatic failover (CO)
-   Load distribution to prevent overload in case of failure

For an individual resource:

-   Single point of failure (SPOF)
-   Easy updates
-   Maintenance problems
-   Simple reliability
-   Limited vertical scalability

3 Copy Disaster Recover Solution

-   Maintains 3 copies of data/resources with at least 2 in different
    locations
-   Enables quick switchover in case of disaster for business continuity
-   Provides high availability and protects against data loss.

Load Balancing

Serial vs. Redundant Availability

-   Serial chain of components:
    -   Availability decreases with more members in the chain
    -   Individual components need higher availability
-   Redundant, parallel components:
    -   Unavailability of each component is multiplied and subtracted
        from 1 to determine overall availability
    -   Only one component needs to be up to maintain availability.

Global Server Load Balancing

-   DNS Round Robin:
    -   Simple load balancing technique that distributes traffic to
        multiple servers based on the client’s DNS query
    -   Little mitigation in case of problems like overload, failure,
        etc.
    -   Clients may disregard TTL settings
    -   Takes approximately 15 minutes to drain traffic from troubled
        servers.
-   BGP Anycast:
    -   Uses BGP routing to direct clients to the nearest available
        server
    -   BGP does not consider link latency, throughput, packet loss,
        etc. in selecting the best route
    -   With multiple routes to the destination, BGP simply selects the
        one with the least number of hops
    -   Troubleshooting can be demanding.
-   Geo-DNS:
    -   Uses the client’s geographical location to determine the closest
        server for traffic distribution
    -   Relies on the accuracy of the DNS provider’s IP and location
        guesswork
    -   TTL setting may not accurately reflect the time-to-live of
        cached information.
-   Real User Measurements (RUM):
    -   Uses real-time data from end-user devices to dynamically adjust
        traffic distribution for optimal performance.

Failover with Virtual IPs

Failover with one virtual IP:

-   DNS points only to one Virtual IP (VIP)
-   In case of a server failure, client sessions are lost but they can
    establish a new session on reconnect
-   No changes in DNS are required, avoiding the potential issues of
    flushes and timeouts

Multi-site failover:

-   A combination of geo-aware DNS and a Load Balancer/Fail-over
    front-server
-   Requests can be re-routed to different locations in case of a server
    failure
-   May still have the limitations and issues associated with geo-aware
    DNS.

Failover, Load Balancing and Session State

-   Sticky Sessions: Keeps session state on a single server, offers
    advantages with a non-replicated system of records but limited in
    terms of fail-over and load-balancing options.
-   Session Storage in DB: Session state is stored in a database, offers
    better scalability and fail-over options compared to sticky
    sessions.
-   Session Storage in Distributed Cache: Session state is stored in a
    distributed cache, provides better performance and scalability
    compared to database storage, but still with fail-over options.

Today: Stateless servers with state in DB are the norm, but sticky
sessions are still useful because records need to be replicated.

Compromise: Replicate sessions between pairs of servers, then enable
switching between them as failovers

P2P Load Balancing

-   Evaluator Functions: Access server stats in shared memory and
    determine the outcome of a request, whether it is handled by its own
    server, redirected, or proxied.
-   Server Stats: Various metrics such as CPU usage, number of requests,
    memory usage, etc., are replicated in shared memory and used by
    evaluator functions to make load-balancing decisions.
-   Server Stat Replication: The replication of server stats is done
    through multicast.

Caching

CQRS (Command Query Responsibility Segregation)

-   Separates the responsibilities of reading data (queries) and
    modifying data (commands) into separate objects or services.
-   Improves scalability and performance by allowing reads and writes to
    be optimized separately.
-   Promotes event-driven architecture by allowing commands to trigger
    domain events.
-   Simplifies domain modeling by reducing the complexity of aggregates.
-   Increases consistency by using separate models for reads and writes.
-   Reduces the coupling between the read and write sides of the system.

Requirements of Caching Services

-   Scalable with ability to add machines
-   Avoid “thundering herds” due to placement changes
-   Supports replication of cache entries
-   High performance required
-   Optional disk backup support
-   Supports various storage mediums, from RAM to SSD
-   Supports different cache replacement policies with caution.

Handling Changing Machine Counts in Caching Services

-   Problem: Changing Machine Count
-   Solution: Consistent Hashing (Ring)
-   Machines are mapped into a ring and their position determines the
    key-space they are responsible for.
-   Machines can be assigned multiple virtual positions.

Consistent Hashing Algorithms

Simple Consistent Hashing Algorithm:

-   URLs and caches are mapped to points on a circle using a standard
    hash function.
-   URL assigned to closest cache in clockwise direction.
-   Adding a new cache only reassigns the closest URLs to it, items
    don’t move between existing caches.

Dynamo Consistent Hashing Algorithm:

-   Separates placement and partitioning.
-   Uses virtual nodes assigned to real machines for more flexibility.
-   Virtual node is responsible for multiple real nodes.
-   Improved load balancing due to additional indirection.

Cache Patterns

Pull:

-   Occurs during request time
-   Concurrent misses and client crashes can result in outdated caches
-   Complicated handling of concurrent misses and updates
-   Can be slow and dangerous for backends

Push:

-   Automated push updates cached values
-   Should only be used for values that are always needed

Pre-warmed:

-   The system loads the cache before the application starts serving
    clients
-   Used for big applications with pull caches to avoid boot issues

General consideration: Be aware of LRU or clocked invalidations as cache
is mission critical.

Cache Design Considerations

-   Kinds of information fragments
-   Lifecycle of fragments
-   Validity of fragments
-   Effects of fragment invalidation
-   Dependencies between fragments, pages, etc.

Events

Local vs. Distributed Events

Local:

-   Observer updates sent on one thread
-   If observer doesn’t return, mechanism stops
-   If observer calls back to observed during update call, deadlock can
    occur
-   Solution doesn’t scale and is not reliable (e.g. observer crashes
    result in lost registrations)
-   Does not work for remote communication

Distributed:

-   Various combinations of push and pull models possible
-   Receivers can install filters using a constraint language to filter
    content (reduces unwanted notifications)

Asynchronous Event-Processing

-   Publisher and subscriber communicate through interaction middleware
-   Used to decouple components and asynchronous sub-requests from
    synchronous main requests (so that multiple fast tasks can run
    parallel to a slow main task)
-   Implemented as Message-Oriented-Middleware (MOM) or socket-based
    communication library
-   Can be implemented in broker-less or brokered mode.

Features of Event-Driven Interaction

-   Basic Event: Any entity can send and receive events without
    restrictions or filtering.
-   Subscription: A receiver can subscribe to specific events, making
    event delivery more efficient.
-   Advertisement: The sender informs receivers about possible events,
    reducing the need for broadcasting.
-   Content-Based Filtering: The sender, middleware, or receiver can
    apply filtering based on event content.
-   Scoping: Administrative components can manipulate event routes,
    enabling invisible communication between components.

Types of Message Oriented Middleware (MOMs)

Centralized Message-Oriented-Middleware:

-   Collects all notifications and subscriptions in one central place,
    enabling easy event matching and filtering
-   Has a high degree of control and no security/reliability issues on
    clients
-   Can create scalability and single-point-of-failure problems

Clustered Message-Oriented-Middleware:

-   Provides scalability at higher communication costs
-   Has lots of routing/filter-tables at cluster nodes, making filtering
    and routing of notifications expensive

Simple P2P Event Libraries:

-   Local libraries are aware of each other, but components are
    de-coupled
-   Broker-less architecture is faster than brokered ones
-   Does not provide at-most-once semantics or protection against
    message loss
-   Guarantees atomicity and possibly FIFO
-   Examples include ZeroMQ, Aaron, and Nanomsg

Flooding Protocols:

-   Notifications travel towards subscriptions, which are only kept at
    leaf brokers
-   Advantages include that subscriptions become effective quickly, and
    notifications are guaranteed to arrive everywhere
-   Price is many unnecessary notifications to leaf nodes without
    subscribers

ZeroMQ

-   Brokerless socket library for messaging, with message filtering
-   Connection patterns include pipeline, pub/sub, and multi-worker
-   Various transports, including in process, across local process,
    across machines, and multicast groups
-   Message-passing process model without the need for synchronization
-   Multi-platform and multi-language support
-   “Suicidal snail” fail-fast mechanism to kill slow subscribers

Sharding

Horizontal vs. Vertical Sharding

Horizontal: Per (database) row, e.g. first 100 users are in shard 1, 200
in shard 2 etc.

Vertical: Per (database) column, e.g. profile and email is in shard 1,
photos and messages in shard 2 etc.

Sharding Strategies

-   Allow adding heterogenous hardware in the future
-   Sharding should not make app code unstable
-   Sharding should be transparent to the app
-   Sharding and placement strategies should be separate

Horizontal Sharding Functions

Algorithms applied to the key (often: user ID) to create different
groups (shards):

-   Numerical range: users 0-100000, 100001-200000, etc.
-   Time range: 1970-80, 81-90, 91-2000, etc.
-   Hash and modulo calculation
-   Directory-based mapping using a meta-data table for arbitrary
    mapping from key to shard

Consequences of Sharding

-   No more SQL JOINs, leading to lots of copied data
-   Increased need for partial requests for data aggregation
-   Expensive distributed transactions required for consistency (if
    needed)
-   Vertical sharding distributes related data types from one user,
    while horizontal sharding distributes related users from each other
    (bad for social graph processing)
-   SQL limitations due to mostly key/value queries and problems with
    automatic DB-Sequences
-   Every change requires corresponding application changes

Relationships

Overview

-   Within the database, referential integrity rules protect containment
    relationships
-   No equivalent in object space
-   No protection in distributed systems

For example when an employee leaves:

-   All rights are cancelled
-   Disc-space is archived and erased
-   Databases for authentication and application-specific DBs are
    updated
-   Badge no longer works
-   All equipment has been returned

Functional Requirements of Relationship Services

-   Definition of relations between objects without modifying those
    objects
-   Support for different types of relations
-   Ability to create graphs of relations
-   Ability to traverse relationship graphs
-   Support for reference and containment relations

Why Relationship Services Failed

The good:

-   Powerful modeling tool
-   Helps with creation, migration, copy, and deletion of composite
    objects
-   Maintains referential integrity

The bad:

-   Tends to create many small server objects
-   Performance impact
-   Not supported by many CORBA vendors for a long time
-   EJB only supported with local objects in the same container.

Distributed Services and Algorithms II

Problems with Classic Concurrency

Why Truth is Expensive

-   Strong consistency is discouraged.
-   Coordination and distributed transactions slow down the process and
    affect availability.
-   The cost of knowing the truth is high for many applications.
-   The truth might only be a partial or outdated version.
-   Availability is prioritized over consistency by making local
    decisions with available information.
-   Improves the user experience by making this trade-off, most of the
    time.

Aspects of Classic Distributed Consistency

-   Distributed Objects and Persistence: Objects that span across
    multiple systems and persist data in multiple locations.
-   ACID: Atomicity, Consistency, Isolation, Durability - a set of
    properties that guarantee that database transactions are processed
    reliably.
-   Transactions: A sequence of database operations that are executed as
    a single unit of work.
-   Isolation Levels: The level of isolation between concurrent
    transactions, specifying how one transaction affects another.
-   Two-Phase Locking: A protocol for enforcing serializable access to
    shared resources in a distributed system.
-   Distributed Transactions: Transactions that span multiple systems
    and persist data in multiple locations.
-   Two-Phase Commit (2PC): A protocol for ensuring that a transaction
    is committed in a consistent state in a distributed system.
-   Failure Models for 2PC: Models for how 2PC protocol handles system
    failures and ensures the consistency of transactions.

Locking

Locking Against Concurrent Access

Binary locks:

-   Used to synchronize an object, causing all clients except one to be
    blocked.
-   Limitations: Binary locks are simple to use, but their performance
    suffers as they cannot distinguish between reads and writes.

Modal locks (read/write locks):

-   Used to allow clients who only want to read to obtain read locks.
    Many concurrent read locks are possible.
-   Advantages: Modal locks allow for a more nuanced approach to
    concurrent access, improving performance by allowing multiple read
    operations to occur simultaneously.

Lock Granularity: The granularity of locks (the scope of the resources
being protected by the lock) affects the overall throughput of a system.
The smaller the lock granularity, the better the performance will be.

Optimistic Locking

Process:

1.  Lock a row, read it along with its timestamp, and then release the
    lock.
2.  Start a transaction
3.  Write the data to the database.
4.  Acquire locks for all data read and compare the data timestamps.
5.  If one of them is newer, the transaction operation is rolled back,
    otherwise it is commited.

Advantages:

-   Better overall throughput as locks are held for only a short period
    of time
-   Timestamp comparison logic is implemented as a framework mechanism
    in the client session objects, simplifying the process

Serializability with Two-Phase Locking

Process:

1.  Allocate all locks
2.  Manipulate the data
3.  Release all locks

Advantages: Requires that all locks be allocated before any data
manipulation and released only after the manipulation is complete.
Guarantees serializability.

Deadlocks

-   State where two or more processes are blocked because each one is
    waiting for resources held by the other
-   Results in a situation where the processes cannot continue to run
    and are stuck in a permanent waiting state
-   Can occur in concurrent systems where multiple processes access
    shared resources

Distributed Deadlock Detection

Local wait-for-graphs:

-   Correctness: Based on the definition of a wait-for-graph, this
    method correctly detects deadlocks by identifying cycles in the
    graph.
-   Liveness: This method can only detect deadlocks that exist within a
    single process or machine, so it may miss deadlocks in a distributed
    system.
-   Cost/complexity: The cost of implementing this method is relatively
    low, as it only requires tracking locks and resource requests within
    a single process.
-   Failure model: This method is susceptible to false negatives (missed
    deadlocks) in a distributed system.
-   Architecture type: This method is suitable for systems with a
    centralized architecture, where all locks and resource requests can
    be monitored by a single process.

Detection servers:

-   Correctness: Detection servers are designed to detect deadlocks in a
    distributed system, so this method should provide correct results if
    implemented correctly.
-   Liveness: This method is designed to detect deadlocks in a
    distributed system, so it should have better liveness compared to
    local wait-for-graphs.
-   Cost/complexity: The cost of implementing this method is higher than
    local wait-for-graphs, as it requires communication and coordination
    between multiple processes.
-   Failure model: This method is susceptible to false negatives if one
    or more detection servers fail, or if there are errors in the
    communication between the servers.
-   Architecture type: This method is suitable for systems with a
    decentralized architecture, where multiple processes are involved in
    the detection of deadlocks.

Distributed edge chasing algorithms:

-   Correctness: This method is designed to detect deadlocks in a
    distributed system, so it should provide correct results if
    implemented correctly.
-   Liveness: This method is designed to detect deadlocks in a
    distributed system, so it should have better liveness compared to
    local wait-for-graphs.
-   Cost/complexity: The cost of implementing this method is higher than
    local wait-for-graphs, as it requires communication and coordination
    between multiple processes.
-   Failure model: This method is susceptible to false negatives if
    there are errors in the communication between the processes.
-   Architecture type: This method is suitable for systems with a
    decentralized architecture, where multiple processes are involved in
    the detection of deadlocks.

Stochastic detection:

-   Correctness: The accuracy of this method depends on the parameters
    used, so it may provide incorrect results in some cases.
-   Liveness: This method is designed to detect deadlocks in a
    distributed system, so it should have better liveness compared to
    local wait-for-graphs.
-   Cost/complexity: The cost of implementing this method is relatively
    low, as it only requires monitoring resource requests and using
    randomization to make decisions.
-   Failure model: This method may miss deadlocks if the randomization
    parameters are not set correctly.
-   Architecture type: This method is suitable for systems with a
    decentralized architecture, where multiple processes are involved in
    the detection of deadlocks.

Transactions

Classic ACID Definitions

-   Durability: Ensures that once a transaction is committed, its
    effects persist even in the case of system failures (e.g. a crash
    that causes you to lose changes made to a word file)
-   Atomicity: Ensures that a transaction is treated as a single,
    indivisible unit of work that either happens in its entirety or
    doesn’t happen at all (e.g. in the case of a birthday party
    re-schedule where not all participants were caught in time)
-   Isolation: Ensures that the concurrent execution of transactions
    results in a system state that would be obtained as if transactions
    were executed serially (e.g. if two people work on a shared file,
    their changes should not interfere with each other)
-   Consistency: Ensures that the system remains in a valid state after
    a transaction is executed (e.g. after you complete a friend’s work
    for the day, the tasks remain consistent, and the system remains in
    a valid state)

Transaction Properties and Mechanisms

-   Atomic Changes over Distributed Resources: This is achieved through
    the use of consensus or voting algorithms such as two-phase commit.
-   Consistency: This is maintained by observing consistency constraints
    between objects, such that the system remains in a valid state
    before and after a transaction is executed.
-   Isolation from Concurrent Access: This is accomplished through the
    use of locking mechanisms, such as two-phase locking or hierarchical
    locking.
-   Durability of Changes: This is ensured by transferring changes made
    to memory objects to persistent storage, to prevent loss in case of
    a system failure.

Serializability and Isolation

Definition: States that the outcome of executing a set of transactions
should be equivalent to some serial execution of those transactions.

Purpose: The purpose of serializability is to ensure that each
transaction operates on the database as if it were running by itself,
which maintains the consistency and correctness of the database.

Importance: Without serializability, ACID consistency is generally not
guaranteed, making it a crucial component in maintaining the integrity
of the database.

Transaction API

1.  System starts in a consistent state
2.  Begins transaction
3.  Modifies objects

Commit transaction:

-   System has a new, consistent state
-   Local objects are now invalid
-   Changes are visible to others

On error: Rollback:

-   Either from system or from client
-   Only successful commit operations become the new state durable and
    visible to others
-   Means going back to the beginning completely
-   Client does not even know that they tried an operation
-   Log files would have to be cleaned.

Components of Distributed Transactions

Process

-   Begin()
-   Commit()
-   Rollback()

RPCs:

-   Register (transactional servers)
-   Vote (objects)
-   Commit/rollback (objects, resource managers)
-   Read/write/prepare (resource managers)

Components:

-   Transaction
-   Transactional client
-   Transactional servers (objects)
-   TACoordinator
-   XA resource managers

Service Context

-   Some services require context information to flow with a call
-   Security: Needs to flow user information, access rights, etc.
-   Transactions: Needs to flow information about ongoing transactions
    to participants
-   The additional information needs to be standardized to allow
    different vendor implementations of services to interoperate.

Distributed Two-Phase Commit

Vote:

-   To achieve atomic operations in a distributed setting, the
    TA-Coordinator asks all participants for their vote on committing or
    rolling back.
-   Upon receiving a commit() call from a client, objects part of the TA
    vote by asking resource managers (e.g. databases) to prepare for the
    commit.
-   A successful return of “prepare” from resource managers means that
    both the object and the resource manager have promised to commit the
    changes if the coordinator sends a commit.

Completion:

-   The coordinator is the only entity that can commit or abort a TA
    after the prepare phase.
-   If the vote phase was successful and all participants have prepared
    for a commit, the coordinator calls for a commit.
-   In case of an error (e.g. unreachable participant), the coordinator
    calls for a rollback.

Failure Models of Distributed Transactions

Work Phase:

-   If a participant crashes or becomes unavailable, the coordinator
    calls for a rollback.
-   If the client crashes before calling commit, the coordinator will
    timeout the TA and call for a rollback.

Voting Phase:

-   If a resource becomes unavailable or has other issues, the
    coordinator calls for a rollback.

Commit Phase (Server Uncertainty):

-   In case of a crashed server, it will consult the coordinator after
    restart and ask for the decision (commit or rollback).

Special Problems of Distributed Transactions

Resources:

-   Participants in distributed TA’s consume many system resources due
    to logging all actions to temporary persistent storage.
-   Large parts of the system may become locked during a TA.

Coordinator as a Single Point of Failure:

-   The coordinator must also prepare for a crash and log all actions to
    temporary persistent storage.

Heuristic Outcomes for Transactions:

-   In certain circumstances, the outcome of a transaction may only be
    determined heuristically if the real outcome cannot be determined.

Transaction Types

Flat Transactions:

-   Characterized by all-or-nothing behavior.
-   Any failure causes complete rollback to original state.
-   Can result in loss of significant amount of work if many objects
    have been handled.

Nested Transactions:

-   Allow partial rollbacks with a parent transaction.
-   Child TA rollback doesn’t affect parent TA, but parent TA rollback
    returns all participants to initial state.
-   Example: Allocation of a travel plan (hotel, flight, rental-car,
    trips, etc.).

Long-running Transactions:

-   Challenge is resource allocation and increasing amount of work lost
    in case of rollback.
-   Syncpoints move the fallback position closer to the commit point.

Compensating Transactions:

-   Improves transaction throughput by making objects visible sooner, at
    the cost of sacrificing the ISOLATION property.
-   Require compensation for previous TA which can no longer be rolled
    back.
-   Depend on the application whether compensating transactions are
    possible.
-   Can be hand-coded if no transaction monitor/manager is available.

ANSI Transaction Levels

Problems:

-   Dirty reads: Occurs when a transaction reads data written by another
    concurrent transaction that has not yet been committed.
-   Non-repeatable reads: Occurs when a transaction re-reads data it has
    previously read and finds that the data has been modified by another
    transaction that has since committed.
-   Phantom reads: Occurs when a transaction re-executes a query
    returning a set of rows that satisfies a search condition and finds
    that the set of rows satisfying the condition has changed due to
    another recently-committed transaction.

Transaction Levels:

-   Read Uncommitted:
    -   Prevents: Nothing
-   Read Committed:
    -   Prevents:
        -   Dirty reads.
-   Repeatable Read:
    -   Prevents:
        -   Dirty reads
        -   Non-repeatable reads
-   Serializable:
    -   Prevents:
        -   Dirty reads
        -   Non-repeatable reads
        -   Phantom reads

The higher the level, the more overhead is required.

NoSQL

Forces behind NoSQL

-   Need for low-latency and high-throughput access to data
-   Difficulty in managing and maintaining consistency in a distributed
    system
-   Increased focus on scalability and flexibility
-   Changing data requirements and needs for real-time processing
-   Cost and complexity of traditional RDBMs in large-scale systems
-   Inability of RDBMs to handle large amounts of unstructured data
-   The need for horizontal scaling in storage
-   Lack of support for real-time, complex data processing using RDBMs
-   The need for automatic scaling in storage to keep up with rapidly
    growing data
-   Relaxed data consistency requirements in some applications.

Scaling Problems of RDBMs

-   Poor time complexity of SQL joins: O(m+n) or worse
-   Difficulty in horizontally scaling, resulting in loss of joins or
    jumping between nodes
-   Unbounded nature of queries, which can lead to a single query
    overloading a database
-   Optimized for storage efficiency (no duplicates), integrity, and
    flexibility of access through arbitrary joins.

NoSQL Design Patterns

-   Use partition keys with many distinct values for better scalability
    and data distribution.
-   Opt for a single table design with hierarchical modeling and
    de-normalization to simplify the data structure.
-   Ensure that values are evenly requested to avoid hot spots.
-   Utilize composite secondary keys for 1:n and n:n queries.
-   Limit query responses with paging token for better performance.
-   Consider the use case and access patterns before finalizing the data
    layout.
-   Avoid relational modeling and instead focus on simplifying the data
    structure.
-   Data integrity is an application concern and should be handled by
    the application logic.
-   Data storage efficiency is not a primary concern.

DynamoDB Design Principles

-   Decentralized design with no single point of failure (no master
    node)
-   Supports heterogeneous hardware
-   Symmetric peers for better scalability
-   Incrementally scalable to handle increasing load
-   Eventually consistent data replication
-   Requires a trusted environment for data security
-   Replication support for higher data availability. Always-write
    enabled with conflict resolution during read
-   Multi-version store with conflict resolution policies for better
    data management

Beyond Relaxed Consistency

Overview

-   Order-insensitive processing using CALM (Consistency as Logical
    Monotonicity) principles in EC (Eventual Consistency) programs
-   Converging replicated data types (CRDTs) divided into two types:
    -   State-based CRDTs
    -   Operation-based CRDTs

CALM Principle

-   “Consistency as Logical Monotonicity”
-   Links consistency with logical monotonicity, where monotonic
    programs ensure eventual consistency regardless of the order of
    delivery and computation.
-   Monotonic programs do not require coordination, unlike non-monotonic
    programs where adding an element to the input set can revoke a
    previously valid output.
-   Non-monotonic programs require coordination schemes that wait until
    inputs are complete before proceeding.

CALM Operations

Logically Monotonic:

-   Initializing variables
-   Accumulating set members
-   Testing a threshold condition

Non-monotonic:

-   Overwriting variables
-   Set deletion
-   Resetting counter
-   Negation

CRDTs

State-based CRDTs:

-   Calculate the new result at one node and then propagate it to
    replicas.
-   The data structure must be commutative, associative, and idempotent,
    e.g., sets.

Operation-based CRDTs:

-   Send the requested operation to each replica and calculate the
    results locally.
-   The operations must be commutative with “exactly once” semantics
    (idempotent) and in FIFO order.
-   These delivery guarantees are difficult to achieve, making
    state-based CRDTs more popular currently.

Bending the Problem

-   Separates data store and application-level consistency concerns.
-   CALM, ACID 2.0, and CRDT appeal to higher-level consistency criteria
    in the form of application-level invariants.
-   Instead of requiring strong consistency for every read and write,
    the application only needs to ensure semantic guarantees (e.g., “the
    counter is strictly increasing”).
-   This grants more flexibility in how reads and writes are processed.

Examples of CRDTs

Counters:

-   Grow-only counter: Merge operation is max(values), payload is a
    single integer
-   Positive-negative counter: Consists of two grow counters, one for
    increments and another for decrements

Registers:

-   Last Write Wins register: Uses timestamps or version numbers, merge
    operation is max(ts), payload is a blob
-   Multi-valued register: Uses vector clocks, merge operation takes
    both values

Sets:

-   Grow-only set: Merge operation is union(items), payload is a set, no
    removal is allowed
-   Two-phase set: Consists of two sets, one for adding and another for
    removing, elements can be added once and removed once
-   Unique set: Optimized version of the two-phase set
-   Last write wins set: Merge operation is max(ts), payload is a set
-   Positive-negative set: Consists of one PN-counter per set item
-   Observed-remove set

Distributed Coordination

Features:

-   Configuration changes and notifications
-   Updates for failed machines
-   Dynamic integration and deconfiguration of new machines
-   Elastic configuration with partial failures
-   API for watches, callbacks, automatic file removal, and triggers
-   Simple data model (directory tree model)
-   High performance and highly available in-memory cluster solution
-   No locks for updates, but total ordering of requests for all cluster
    replicas
-   All replicas answer reads
-   Wait-free implementation of coordination service with client API
    performing locks, leader selection, etc

Liveness and Correctness:

-   Sequential Consistency: Updates from a client will be applied in the
    order they were sent.
-   Atomicity: Updates either succeed or fail. No partial results.
-   Single System Image: A client will see the same view of the service
    regardless of the server it connects to.
-   Reliability: Once an update has been applied, it will persist from
    that time forward until a client overwrites the update.
-   Timeliness: The client’s view of the system is guaranteed to be
    up-to-date within a certain time bound.

Distributed Coordination Services

Zookeeper API

-   create: Creates a node at a specified location in the tree
-   delete: Deletes a node from the tree
-   exists: Tests if a node exists at a specified location in the tree
-   get data: Retrieves the data stored at a node
-   set data: Writes data to a node
-   get children: Retrieves a list of children of a node
-   sync: Waits for data changes to be propagated to all nodes in the
    cluster.

Primary-Order Atomic Broadcast with Zab

-   Primary sends non-commutative, incremental state changes to backup
    units
-   Order of incremental changes maintained even in case of primary
    crash
-   Multiple outstanding requests possible
-   Identification scheme to prevent re-ordering of updates
-   Synchronization phase to ensure old updates delivered before new
    ones stored.

Consistency Requirements for ABCast (Reliable Ordered Atomic Broadcast)

-   Validity: If a correct process broadcasts a message, all correct
    processes will eventually deliver it.
-   Uniform Agreement: If a process delivers a message, all correct
    processes will eventually deliver it.
-   Uniform Integrity: Every process delivers a message at most once,
    only if it was previously broadcast by sender.
-   Uniform Total Order: If processes p and q both deliver messages m
    and m0, their order must be the same.

Primary Order

-   Local primary order: If primary broadcasts (v, z) before (v’, z’),
    process that delivers (v, z) must have delivered (v’, z’) before (v,
    z).
-   Global primary order: If Pi broadcasts (v, z) and Pj > Pi broadcasts
    (v’, z’), process delivering both (v, z) and (v’, z’) must deliver
    (v, z) first.
-   Primary integrity: If Pe broadcasts (v, z) and some process delivers
    (v’, z’) broadcast by Pe’ < Pe, Pe must have delivered (v’, z’)
    before broadcasting (v, z).

HA Transactions

-   Provide transactional guarantees without unavailability during
    system partitions or high network latency (Non-failing replica must
    respond)
-   Not CAP: Can’t provide linearizability as reading the most recent
    write from a replica
-   Not HAT-compliant: Serializability, Snapshot Isolation, Repeatable
    Read Isolation
-   Possible with algorithms relying on multi-versioning and client-side
    caching: Read Committed Isolation, transactional atomicity, etc.
-   Causal consistency with phantom prevention and ANSI Repeatable Read
    need affinity with at least one server (sticky sessions)
-   Unable to prevent concurrent updates to shared data items, cannot
    provide recency guarantees for reads.

Design of Distributed Systems

Design Principles for Distributed Systems

Overview

-   Consideration of Latency: Examination of buffering and round-trip
    times
-   Importance of Locality: Proper placement of heavily interacting
    components
-   Avoiding Duplication of Work: Utilizing resources effectively
-   Resource Pooling: Reusing resources in communication such as
    connections or thread pools
-   Parallelization: Design for concurrent operations and minimize
    serialization
-   Evaluating Consistency: Determining the appropriate level of
    consistency with caching and replication
-   Caching and Replication Strategies: Utilizing prediction and
    bandwidth to reduce latency
-   End-to-end Argument: Minimizing heavy guarantees at lower levels of
    the system.

  Know Your No. 1 Enemy: Latency!

Sharing Ressources and Data

-   Pooling resources can improve performance even in local systems
-   High-frequency requests can lead to memory allocation issues and
    poor performance
-   Caching is crucial for the effectiveness of distributed applications
-   Minimizing backend requests while maintaining sane application logic
-   Breaking down information into smaller fragments can reveal reusable
    parts

Connection Pooling

-   Matching server and database CPU capabilities
-   Avoiding blocking and app threads holding onto connections
-   Careful monitoring of wait time in the pool
-   Checking I/O rates with new hardware
-   Understanding what constitutes a “connection” to storage
-   Monitoring core/thread ratio, etc.

Horizonal Scaling/Parallelization

-   Horizontal scaling through parallel processing
-   Every request can be handled by any thread on any host
-   Avoid synchronization points in servlet engines or database
    connections.

Caching and Replication

-   Caching components are responsible for maintaining data validity
-   Data source is responsible for keeping replicas consistent and
    up-to-date
-   Focus on reducing back-end requests for improved efficiency.

End-to-End Argument

-   User/Developer: Compensation for behavior through application
-   Application Layer: Use of special commands, such as “Select for
    Update” or “Begin Transaction”
-   Intermediate Layer: Compiler/Languages utilizing technologies such
    as Software Transactional Memory and memory models
-   Base Layer: Considerations for CPU cache coherence, database
    isolation levels, and real-time streaming, etc.

Design Methodology

-   Back-of-the-envelope calculations
-   Decide on geographical distribution and replication strategy
-   Determine data segregation, including single or multi-tenancy models
    and partitioning
-   Divide business requirements into REST-like services
-   Define SLAs for services, including availability, latency,
    throughput, consistency, and durability
-   Define security context with IAAA (Identity, Authentication,
    Authorization, Audit) and perform risk analysis
-   Complete monitoring and logging setup
-   Plan for deployment, release changes, testing, and maintenance using
    fault-tolerant features.

Uncomfortable Real-World Questions

-   How many application servers are needed to support the customer
    base?
-   What is the optimal ratio of users to web servers?
-   What is the maximum number of users per server?
-   What is the maximum number of transactions per server?
-   Which specific hardware configurations provide the best performance?
-   What is the current production server capability?
-   What do the users do? (These are business process definitions.)
-   How fast do the users do it? What are the transaction rates of each
    business process?
-   When do they do it? What time of day are most users using it?
-   What major geographic locations are they doing it from?
-   How many connections can the server handle?
-   How many open file descriptors or handles is the server configured
    to handle?
-   How many processes or threads is the server configured to handle?
-   Does it release and renew threads and connections correctly?
-   How large is the server’s listen queue?
-   What is the server’s “page push” capacity?
-   What type of caching is done?

Best Practices for Designing Services

-   Keep services independent
-   Measure services
-   Define SLAs and QoS for services
-   Allow agile development of services
-   Allow hundreds of services, aggregate them on special servers
-   Avoid middleware and frameworks that force patterns
-   Keep teams small and organized around services
-   Manage dependencies carefully
-   Create APIs for customer access to services

Architecture Fields

Overview

-   Information Architecture
-   Distribution Architecture
-   System Architecture
-   Physical Architecture
-   Architectural Validation

Information Architecture (to analyze Caching)

  Defines pieces of information to aggregate or integrate

  -----------------------------------------------------------------------
  Data/changed   Time                          Personalization
  by                                           
  -------------- ----------------------------- --------------------------
  Country Codes  No (not often, reference      No
                 data)                         

  News           Yes (aging only)              No, but personal
                                               selections

  Greeting       No                            Yes

  Message        Yes (slowly aging)            Yes
  -----------------------------------------------------------------------

Distribution Architcture

  Tells portal how to map/locate fragments defined in the information

  ------------------------------------------------------------------------------------------------------------------
  Data Type  Source   Protocol   Port   Avg.    Worst   Downtimes     Max     Loadbal.   Security   Contact/SLA
                                        Resp.   Resp.                 Conn.                         
  ---------- -------- ---------- ------ ------- ------- ------------- ------- ---------- ---------- ----------------
  News       hostX    http/xml   3000   100ms   6 sec.  17.00-17.20   100     client     plain      Mrs.X/News-SLA

  Research   hostY    RMI        80     50ms    500ms   0.00-1.00     50      server     SSL        Mr.Y/res-SLA
  ------------------------------------------------------------------------------------------------------------------

Additional factors to consider:

-   Available bandwidth
-   Number of planned requests
-   Distance to the device
-   Availability numbers

Example results:

-   Back-end server performance affecting home page construction time
-   Huge latencies and variations in response times causing
    instabilities in the portal application
-   Getting to the servers for every request is nearly impossible due to
    huge latencies and variations in response times from dependencies.

Service Access Layer

Is determined by the distribution architecture.

-   Handle changes in the interface
-   Monitors backend system connections
-   Disable connections that are not functioning properly (“fail fast”)
-   Add new sources to the system
-   Poll and re-enable sources that have been temporarily disabled
-   Keep track of statistics on all sources.

Simple Alternative: Sidecar, contains circuit breaker & service
discovery

Advanced Alternative: Service mesh with separate data and control plane

Physical and Process Architecture

Physical Architecture:

-   Deals with reliability issues (replication, high-availability, etc.)
    and scalability (horizontal and/or vertical)
-   Need to define scalability methods from the beginning due to their
    impact on overall system architecture

Horizontally scalable application:

-   Replicated on multiple hosts
-   Avoids single point of failure

Vertically scalable application:

-   Can only install more CPUs or RAM on single instance of host
-   Limited scalability and availability (HA application)

Architecture Validation

In the architecture validation phase these questions are answered: How
does the architecture …

-   Handle security and privacy?
-   Handle data consistency and durability?
-   Handle disaster recovery and business continuity?
-   Handle performance, scalability and capacity?
-   Handle integration with other systems and data sources?
-   Handle upgrades, maintenance and support?
-   Align with the organization’s goals, strategies and plans?

Fan-Out Architecture

Overview

Calls are parallel instead of serial.

-   The overall request time is determined by the slowest sub-request
-   Each delay in an individual call adds to the runtime
-   Long timeout settings negatively impact response times
-   Using short timeouts for back-end server calls is recommended
-   Running short requests in separate threads may not be productive,
    consider request bundling
-   Error from one sub-call should not block the whole request, have a
    fallback
-   Avoid all threads getting stuck on a dysfunctional sub-call
    (bulkhead)
-   Temporarily close dead connections (circuit-breaker).

Reliability Issues in Dependencies

-   System load becomes worse due to hanging requests occupying
    resources and leading to heavy garbage collection
-   Dead servers can cause a buildup of threads due to even short
    timeouts
-   The portal was frequently impacted by failing back-end servers
-   Avoid lengthy waiting time for sub-requests in the homepage action
    handler: Adopt the “Fail-fast” pattern today.

Fragments

Pages: Unique to customers, cannot be re-used

Page fragments:

-   Can be shared and heavily re-used
-   Allows huge reduction in back-end requests
-   Downside: If fragments change, mechanism needed to invalidate
    dependent pages.

Latency Reduction & Tolerance

-   Keep response times tight but aware of stragglers
-   Fight stragglers with backup requests and cross-server cancellation
-   Watch for overload at sender when responses come back
-   Do NOT distribute load evenly, synchronize background load across
    machines instead
-   Reduce head-of-line blocking (partition large requests)
-   Partition data across machines
-   Cheat by coming back with partial data
-   Cross request adaptation
-   Increase replication count
-   Beware of the incast problem

Avoiding Getting Stuck

-   Fail Fast: Don’t wait for problematic resources
-   Timeouts: Use timeouts when accessing a service
-   Exponentially Decreasing Retries: Use if needed
-   Fallback: Use alternatives when service doesn’t work, such as
    serving stale data
-   Caching: Retrieve data from cache if real-time dependency is
    unavailable, even if data is stale
-   Eventual Consistency: Queue writes to be persisted once dependency
    is available
-   Stubbed Data: Revert to default values if personalized options can’t
    be retrieved
-   Empty Response (Fail Silent): Return null or empty list that UIs can
    ignore.

Containing Failures

Circuit Breakers

-   Purpose: Handle faults that might take a long time to recover from
-   Provide control mechanism to prevent application from continually
    trying to perform a failing operation
-   Allows application to fail fast and respond to failures quickly
-   Acts as a switch that “trips” when system detects a failure
-   Stops application from making further attempts to perform operation
    until reset
-   Helps prevent application from becoming unresponsive
-   Protects other parts of the system from being affected by the
    failure.

Bulkheads

-   Bulkhead pattern is a design for fault-tolerant applications
-   Elements of an application are isolated into pools
-   If one pool fails, others will continue to function
-   Named after the sectioned partitions (bulkheads) of a ship’s hull
-   Example: semaphores and thread pools

Blast Reduction

-   Partition app into geographical regions (e.g. US, DACH etc.)
-   Splitting regions further into specific availability zones and
    further cells
-   Shuffle sharding: Provide a single-tenant-like isolation for shared
    workloads
-   Splitting app itself into separate control and data planes