Uni Distributed Systems Condensed Summary

Condensed summary for the distributed systems course at HdM Stuttgart

Felicitas Pojtinger

2023-02-05

1 Meta

1.1 Contributing

These study materials are heavily based on professor Kriha’s “Verteilte Systeme” lecture at HdM Stuttgart and prior work of fellow students.

Found an error or have a suggestion? Please open an issue on GitHub (github.com/pojntfx/uni-distributedsystems-notes):

QR code to source repository
QR code to source repository

If you like the study materials, a GitHub star is always appreciated :)

1.2 License

AGPL-3.0 license badge
AGPL-3.0 license badge

Uni Distributed Systems Notes (c) 2023 Felicitas Pojtinger and contributors

SPDX-License-Identifier: AGPL-3.0

2 Introduction to Distributed Systems

2.1 Laws and Terms

2.1.1 Metcalfe’s Law

Metcalfe’s law is a principle that states that the value or utility of a network increases as the number of users in the network increases. This is because the more people who are using the network, the more useful it becomes as a platform for communication, collaboration, and the exchange of information and resources. The adoption rate of a network also tends to increase in proportion to the utility provided by the network, which is why companies often give away software or other products for free in order to increase the size of their user base and the value of their network.

Metcalfe’s law is often cited as a factor that can contribute to the emergence of scale-free, or power law, distributions in networks. This type of distribution is characterized by a few nodes (or users) with many connections, and many nodes with only a few connections. The existence of network effects, in which the value of a network increases with the number of users, can help to explain why we don’t see many Facebooks or Googles – it can be difficult for new networks to gain traction and achieve the same level of utility as established networks with a large user base.

2.1.2 Generalized Queuing Theory Terms (Henry Liu)

2.1.3 Little’s Law

2.1.4 Hejunka

Hejunka is a Japanese term that refers to the practice of leveling the production process by smoothing out fluctuations in demand and task sizes. It is often used in lean manufacturing and just-in-time (JIT) production systems to improve the efficiency and flow of work through a system.

The goal of Hejunka is to create a steady, predictable flow of work through the system by reducing variability in task sizes and demand. This can be achieved through a variety of methods, such as:

By leveling the production process and reducing variability in task sizes, Hejunka can help to improve the efficiency and flow of work through a system, and reduce the risk of bottlenecks or delays caused by large differences in task size.

2.1.5 Amdahl’s Law

According to Amdahl’s Law, the maximum improvement in overall system performance that can be achieved by improving a particular part of the system is limited by the fraction of time that the improved part of the system is used. In other words, if only a small portion of the system’s workload is affected by the improvement, the overall improvement in performance will also be small.

speedup = \frac{1}{(1-parallelfraction+\frac{parallelfraction}{numberofprocessors}}

For example, if a particular part of a system is improved so that it runs twice as fast, but that part of the system is only used 10% of the time, the overall improvement in system performance will be limited to a 10% increase. On the other hand, if the improved part of the system is used 50% of the time, the overall improvement in performance will be much larger, at 50%.

Amdahl’s Law is often used to understand the potential benefits and limitations of optimizing or improving specific parts of a system. It can be a useful tool for determining how much resources should be invested in improving a particular part of the system, and for understanding the potential impact of those improvements on overall system performance.

2.2 Process and I/O Models

2.2.1 Different Process Models

2.2.2 Questions for Process Models

2.2.3 Different I/O Models

2.2.4 Questions for I/O Models

3 Message Protocols

3.1 Delivery Guarantees

3.1.1 The Role of Delivery Guarantees

Shop order: The scenario described involves an online shop in which orders are placed and processed. The goal is to ensure that orders are delivered correctly and efficiently, regardless of any potential issues that may arise.

3.1.2 Why is TCP not Enough?

While TCP (Transmission Control Protocol) is a widely used networking protocol that provides a reliable communication channel between devices, it is not always sufficient on its own to ensure proper behavior in all situations. Here are some reasons why TCP may not be enough:

To address these and other issues, it may be necessary to use additional protocols or techniques, such as message-oriented middleware or application-level protocols, to ensure proper behavior in case of connection failures and to provide additional features and guarantees for message delivery.

3.1.3 Different Levels of Timeouts

3.1.4 Delivery Guarantees for RPCs

3.2 Idempotency

3.2.1 Overview

Idempotency is a property of operations or requests that ensures that they can be safely repeated without changing the result of the operation. In other words, if an operation is idempotent, it will have the same result whether it is performed once or multiple times.

3.2.2 Server State and Idempotency

Idempotency is an important property to consider when designing operations or requests that may be repeated or delivered multiple times, as it can help to ensure that the operation or request is able to be safely repeated without affecting the overall result. Here are some additional considerations related to idempotency and server state:

3.2.3 Implementing Delivery Guarantees for Idempotent Requests

3.2.4 Repeating Non-Idempotent Operations

If an operation is not idempotent, it means that it cannot be safely repeated multiple times and is likely to have unintended side effects. In this case, there are several measures that can be taken to ensure reliable communication:

It is important to carefully manage state on the server in order to avoid overloading the system with old replies. It may be necessary to set limits on how long to store old replies and when to discard them.

3.3 Order

3.3.1 Request Order in Multi-Point-Protocols

3.3.2 Request Ordering with Multiple Nodes

In a multi-node system, it may be necessary to use a reliable broadcast protocol to ensure that requests are processed in the desired order. Here are some examples of protocols that can be used for request ordering with multiple nodes:

3.3.3 Implementing Causal Ordered Broadcasts

3.3.4 Implementing Absolutely Ordered Casts

There are however disadvantages with these implementations:

3.3.5 Reliable Messaging

Reliable B2B (Business-to-Business) messages require the following qualities:

SOAP and HTTP partially achieve this like so:

  1. The first application layer exchanges persistent messages with the requester.
  2. The requester sends a SOAP message with a message ID, sequence number, and QoS (Quality of Service) tag to the receiver.
  3. The receiver must send an acknowledgement.
  4. The receiver exchanges persistent messages with the second application layer.

3.3.6 Transaction Models

Atomic transactions:

Activity transactions:

4 Theoretical Foundations of Distributed Systems

4.1 Foundational Concepts

4.1.1 The Eight Fallacies of Distributed Computing

4.1.2 Analyzing Latency

4.1.3 Characteristics of Distributed Systems

4.2 Consistency

4.2.1 Liveness vs. Correctness

Correctness and liveness are two important properties of distributed systems that ensure they function as intended and make progress.

Both correctness and liveness are based on assumptions about failures and other conditions in the system, such as fairness and the presence of Byzantine errors. Ensuring that a distributed system has both correctness and liveness is critical for its success.

4.2.2 Liveness and Correctness in Practice

Here is an example of how correctness and liveness can be defined for an event-based system:

Correctness:

Liveness:

Failure Assumptions: Fail-stop model with fairness

In this example, the system is designed to ensure correctness by limiting the notifications a node receives to those it is subscribed to and ensuring that notifications are received only once. It is designed to ensure liveness by ensuring that a node will eventually start receiving notifications after subscribing. The system also makes assumptions about failures, such as the fail-stop model with fairness, which are used to ensure the correctness and liveness of the system.

4.2.3 Timing Models

In distributed systems, timing models refer to the way that events and communication between nodes are synchronized. There are three main types of timing models: synchronous, asynchronous, and partial synchronous.

4.2.4 The Fischer, Lynch and Patterson Result

4.3 The CAP Theorem

4.3.1 Overview

States that in the presence of network partitions, a client must choose either consistency or availability, but not both.

4.3.2 Preconditions for the CAP Theorem

4.3.3 Common Misconceptions of the CAP Theorem

4.3.4 The Modern View of the CAP Theorem

4.3.5 PACELC

PACELC: A more complete portrayal of the space of potential consistency tradeoffs for distributed database systems.

4.4 Failure

4.4.1 Technical Failures

Unfortunately, in most cases there is no failure detection service that can identify when these failures occur and allow others to take appropriate action. However, it is possible to develop a failure detection service that could detect even partitioning and other types of failures through the use of MIBs (Management Information Bases) and triangulation. Applications could also be designed to track themselves and restart if necessary.

4.4.2 Failure Types

4.4.3 Failure Models

Many protocols for achieving consistency and availability make assumptions about failure models. For example, transaction protocols may assume recovery behavior by its participants if the protocol terminates.

4.4.4 Failures and Timeouts

4.4.5 Failure Detectors

A failure detector (FD) is a mechanism used in distributed systems to detect failures of processes or machines. It is not required to be correct all the time, but it should provide the following quality of service (QoS):

4.5 Clocks

4.5.1 Time in Distributed Systems

In distributed systems, there is no global time that is shared across all processes. Instead, different approaches are used to model time in these systems. These approaches include:

Logical time is modeled as partially ordered events within a process or between processes. It is used to represent the relative order of events in a distributed system, rather than the actual clock time at which they occurred.

Causal meta-data in the system can also order the events properly.

4.5.2 Consistent vs. Inconsistent Cuts

4.5.3 Event Clocks (Logical Clocks)

4.5.4 Lamport Logical Clock

4.5.5 Vector Clocks

4.5.6 Physical Interval Time (TrueTime)

4.5.7 Hybrid Clocks

Hybrid clocks are systems that combine elements of both logical and physical models of time in order to address the limitations of each approach. There are several reasons why hybrid clocks may be used in distributed systems:

4.5.8 Ordering in Distributed Event Systems

4.6 Consensus

4.6.1 Overview

Consensus is a process used by a group of processes to reach agreement on a specific value, based on their individual inputs. The objective of consensus is for all processes to decide on a value v that is present in the input set.

4.6.2 Algorithms for Consensus

These protocols offer trade-offs in terms of correctness, liveness (availability and progress), and performance:

4.6.3 Two-Phase Commit (2PC)

Steps:

  1. Preparation phase: In this phase, the processes prepare to commit to a decision. Each process sends a “prepare to commit” message to a coordinator process, which is responsible for coordinating the decision-making process.
  2. Decision phase: Once all the processes have prepared to commit, the coordinator process sends a “commit” message to all the processes. This message instructs the processes to commit to the decision.
  3. Confirmation phase: Each process sends a “commit confirmation” message to the coordinator process, indicating that it has successfully committed to the decision.
  4. Finalization phase: Once all the processes have confirmed that they have committed to the decision, the coordinator process sends a “commit complete” message to all the processes, indicating that the decision has been successfully made.

Example:

  1. Imagine that there are three processes in a distributed system: A, B, and C.
  2. The coordinator process is A.
  3. The processes are deciding whether to commit to a new software update.
  4. In the preparation phase, A sends a “prepare to commit” message to B and C.
  5. B and C send “prepare to commit” messages back to A.
  6. In the decision phase, A sends a “commit” message to B and C.
  7. In the confirmation phase, B and C send “commit confirmation” messages back to A.
  8. In the finalization phase, A sends a “commit complete” message to B and C, indicating that the decision to commit to the software update has been successfully made.

Liveness and Correctness:

4.6.4 Quorum-Based Consensus

Steps:

  1. A decision is proposed by one of the processes in the distributed system.
  2. Each process in the system votes on the proposed decision.
  3. The votes are counted and checked against the quorum requirement. The quorum requirement is the minimum number of votes that must be received in favor of the decision in order for it to be approved.
  4. If the quorum requirement has been met, the decision is considered to have been approved.
  5. The processes move forward with implementing the decision.

Example:

  1. A group of five processes in a distributed system (A, B, C, D, and E) are deciding whether to commit to a new software update.
  2. Process A proposes the decision to update the software.
  3. Processes B, C, D, and E all vote on the proposed decision.
  4. The votes are counted and checked against the quorum requirement, which is set at three votes.
  5. Since a quorum of three votes has been received in favor of the decision, it is considered to have been approved.
  6. The processes move forward with implementing the software update.

Liveness and Correctness:

4.6.5 Raft

RAFT is a distributed consensus protocol that allows a group of processes (called “replicas”) to agree on a value (“decide”) in the presence of failures. RAFT is divided into three distinct roles: Leader, Follower, and Candidate.

The protocol consists of the following steps:

  1. Leader Election:
  1. Log Replication:
  1. State Machine Update:

4.7 Broadcasting

4.7.1 Atomic Broadcast Conditions

A distributed algorithm that guarantees correct transmission of a message from a primary process to all other processes in a network or broadcast domain, including the primary.

It satisfies the following conditions:

It is widely used in distributed computing for group communication and defined as a reliable broadcast that satisfies total order.

4.7.2 Atomic Broadcast Protocol

Data:

Phases:

  1. Leader election/discovery: Members decide on a new leader and form a consistent view of the group.
  2. Synchronization/recovery: Leader gathers outstanding, uncommitted requests recorded at members and updates members missing certain data until all share the same state.
  3. Working: Leader proposes new transactions to the group, collects confirmations, and sends out commits.

4.7.3 Gossip Protocols

Gossip protocols are a class of distributed algorithms that rely on randomly chosen pairs of nodes in a network to exchange information about the state of the system. They are typically used for group membership, failure detection, and dissemination of information.

There are several key characteristics of gossip protocols:

4.7.4 DWAL

A DWAL (Distributed Write-Ahead-Log) is a data structure that is used to ensure that updates to a distributed system are stored in a way that allows them to be recovered in case of system failure. It is a type of write-ahead log, which means that updates are written to the log before they are applied to the system’s state. This allows the updates to be replayed in the correct order after a system failure.

4.7.5 Design Components of DWALs

4.8 Replication

4.8.1 Properties of Replication Models

4.8.2 Single-Leader Replication

Steps:

  1. Client sends x=5 to Node1 (master)
  2. Master updates node 2 and node 3 (followers)
  3. Client receives changed value (or old value; due to replication lag)

Advantages:

Disadvantages:

4.8.3 Eventually Consistent Reads

Eventual consistency model: Allows for a certain level of lag between updates to be propagated to all replicas

Steps:

  1. Client updates value on Master-Replica node
  2. Master-Replica eventually propagates update to Slave replica
  3. Client performs a stale read from client node, potentially returning outdated value

4.8.4 Multi-Master Replication

Multi-Master Replication (MMR) is a type of replication in which multiple servers can accept write requests, allowing any server to act as a master. This means that updates can be made to any server, and the changes will be replicated to all other servers in the network. MMR can be used to improve the availability and scalability of asystem, as it allows updates to be made to any server and allows multiple servers to handle write requests.

It also introduces the possibility of conflicts, as multiple servers may receive updates to the same data simultaneously. To resolve these conflicts, MMR systems typically use conflict resolution algorithms (last writer wins, keeping different versions, anti-entropy background merge/resolve) or allow the user to manually resolve conflicts.

Steps:

  1. Client 1 writes x=5 to Node 1 (master)
  2. Client 2 writes x=10 to Node 2 (master)
  3. The masters detect a conflict

Conflict types:

4.8.5 Leaderless Quorum Replication

Write:

Read:

4.8.6 Session Modes of Asynchronous Replication

The following guarantees seem to enable “sequential consistency” for a specific client, meaning that the program order of updates from this client is respected by the system. Clients can track these guarantees using vector clocks:

We can also derive session anomalies from this:

4.8.7 Global Modes of Replication

There are multiple different modes to choose from:

Each of them have their own trade-offs:

5 Distributed Services and Algorithms I

5.1 Types of Distributed Services

5.1.1 What is a Distributed Service?

Function provided by a distributed middleware with:

5.1.2 Services and Instances

5.1.3 Core Distributed Services

5.2 Availability

Is defined as:

Availability = \frac{Uptime_{agreed\ upon} - Downtime_{planned\ and\ unplanned}}{Uptime_{agreed\ upon}}

Continuous availability does not allow for planned downtime.

5.2.1 Typical Hardware Causes for Downtime

5.2.2 Availability through Redundancy

Across groups of resources:

Within a group of resources:

Between two resources:

For an individual resource:

5.2.3 3 Copy Disaster Recover Solution

5.3 Load Balancing

5.3.1 Serial vs. Redundant Availability

5.3.2 Global Server Load Balancing

5.3.3 Failover with Virtual IPs

Failover with one virtual IP:

Multi-site failover:

5.3.4 Failover, Load Balancing and Session State

Today: Stateless servers with state in DB are the norm, but sticky sessions are still useful because records need to be replicated.

Compromise: Replicate sessions between pairs of servers, then enable switching between them as failovers

5.3.5 P2P Load Balancing

5.4 Caching

5.4.1 CQRS (Command Query Responsibility Segregation)

5.4.2 Requirements of Caching Services

5.4.3 Handling Changing Machine Counts in Caching Services

5.4.4 Consistent Hashing Algorithms

Simple Consistent Hashing Algorithm:

Dynamo Consistent Hashing Algorithm:

5.4.5 Cache Patterns

Pull:

Push:

Pre-warmed:

General consideration: Be aware of LRU or clocked invalidations as cache is mission critical.

5.4.6 Cache Design Considerations

5.5 Events

5.5.1 Local vs. Distributed Events

Local:

Distributed:

5.5.2 Asynchronous Event-Processing

5.5.3 Features of Event-Driven Interaction

5.5.4 Types of Message Oriented Middleware (MOMs)

Centralized Message-Oriented-Middleware:

Clustered Message-Oriented-Middleware:

Simple P2P Event Libraries:

Flooding Protocols:

5.5.5 ZeroMQ

5.6 Sharding

5.6.1 Horizontal vs. Vertical Sharding

Horizontal: Per (database) row, e.g. first 100 users are in shard 1, 200 in shard 2 etc.

Vertical: Per (database) column, e.g. profile and email is in shard 1, photos and messages in shard 2 etc.

5.6.2 Sharding Strategies

5.6.3 Horizontal Sharding Functions

Algorithms applied to the key (often: user ID) to create different groups (shards):

5.6.4 Consequences of Sharding

5.7 Relationships

5.7.1 Overview

For example when an employee leaves:

5.7.2 Functional Requirements of Relationship Services

5.7.3 Why Relationship Services Failed

The good:

The bad:

6 Distributed Services and Algorithms II

6.1 Problems with Classic Concurrency

6.1.1 Why Truth is Expensive

6.1.2 Aspects of Classic Distributed Consistency

6.2 Locking

6.2.1 Locking Against Concurrent Access

Binary locks:

Modal locks (read/write locks):

Lock Granularity: The granularity of locks (the scope of the resources being protected by the lock) affects the overall throughput of a system. The smaller the lock granularity, the better the performance will be.

6.2.2 Optimistic Locking

Process:

  1. Lock a row, read it along with its timestamp, and then release the lock.
  2. Start a transaction
  3. Write the data to the database.
  4. Acquire locks for all data read and compare the data timestamps.
  5. If one of them is newer, the transaction operation is rolled back, otherwise it is commited.

Advantages:

6.2.3 Serializability with Two-Phase Locking

Process:

  1. Allocate all locks
  2. Manipulate the data
  3. Release all locks

Advantages: Requires that all locks be allocated before any data manipulation and released only after the manipulation is complete. Guarantees serializability.

6.2.4 Deadlocks

6.2.5 Distributed Deadlock Detection

Local wait-for-graphs:

Detection servers:

Distributed edge chasing algorithms:

Stochastic detection:

6.3 Transactions

6.3.1 Classic ACID Definitions

6.3.2 Transaction Properties and Mechanisms

6.3.3 Serializability and Isolation

Definition: States that the outcome of executing a set of transactions should be equivalent to some serial execution of those transactions.

Purpose: The purpose of serializability is to ensure that each transaction operates on the database as if it were running by itself, which maintains the consistency and correctness of the database.

Importance: Without serializability, ACID consistency is generally not guaranteed, making it a crucial component in maintaining the integrity of the database.

6.3.4 Transaction API

  1. System starts in a consistent state
  2. Begins transaction
  3. Modifies objects

Commit transaction:

On error: Rollback:

6.3.5 Components of Distributed Transactions

Process

RPCs:

Components:

6.3.6 Service Context

6.3.7 Distributed Two-Phase Commit

Vote:

Completion:

6.3.8 Failure Models of Distributed Transactions

Work Phase:

Voting Phase:

Commit Phase (Server Uncertainty):

6.3.9 Special Problems of Distributed Transactions

Resources:

Coordinator as a Single Point of Failure:

Heuristic Outcomes for Transactions:

6.3.10 Transaction Types

Flat Transactions:

Nested Transactions:

Long-running Transactions:

Compensating Transactions:

6.3.11 ANSI Transaction Levels

Problems:

Transaction Levels:

The higher the level, the more overhead is required.

6.4 NoSQL

6.4.1 Forces behind NoSQL

6.4.2 Scaling Problems of RDBMs

6.4.3 NoSQL Design Patterns

6.4.4 DynamoDB Design Principles

6.5 Beyond Relaxed Consistency

6.5.1 Overview

6.5.2 CALM Principle

6.5.3 CALM Operations

Logically Monotonic:

Non-monotonic:

6.5.4 CRDTs

State-based CRDTs:

Operation-based CRDTs:

6.5.5 Bending the Problem

6.5.6 Examples of CRDTs

Counters:

Registers:

Sets:

6.5.7 Distributed Coordination

Features:

Liveness and Correctness:

6.6 Distributed Coordination Services

6.6.1 Zookeeper API

6.6.2 Primary-Order Atomic Broadcast with Zab

6.6.3 Consistency Requirements for ABCast (Reliable Ordered Atomic Broadcast)

6.6.4 Primary Order

6.6.5 HA Transactions

7 Design of Distributed Systems

7.1 Design Principles for Distributed Systems

7.1.1 Overview

Know Your No. 1 Enemy: Latency!

7.1.2 Sharing Ressources and Data

7.1.3 Connection Pooling

7.1.4 Horizonal Scaling/Parallelization

7.1.5 Caching and Replication

7.1.6 End-to-End Argument

7.1.7 Design Methodology

7.1.8 Uncomfortable Real-World Questions

7.1.9 Best Practices for Designing Services

7.2 Architecture Fields

7.2.1 Overview

7.2.2 Information Architecture (to analyze Caching)

Defines pieces of information to aggregate or integrate

Data/changed by Time Personalization
Country Codes No (not often, reference data) No
News Yes (aging only) No, but personal selections
Greeting No Yes
Message Yes (slowly aging) Yes

7.2.3 Distribution Architcture

Tells portal how to map/locate fragments defined in the information

Data Type Source Protocol Port Avg. Resp. Worst Resp. Downtimes Max Conn. Loadbal. Security Contact/SLA
News hostX http/xml 3000 100ms 6 sec. 17.00-17.20 100 client plain Mrs.X/News-SLA
Research hostY RMI 80 50ms 500ms 0.00-1.00 50 server SSL Mr.Y/res-SLA

Additional factors to consider:

Example results:

7.2.4 Service Access Layer

Is determined by the distribution architecture.

Simple Alternative: Sidecar, contains circuit breaker & service discovery

Advanced Alternative: Service mesh with separate data and control plane

7.2.5 Physical and Process Architecture

Physical Architecture:

Horizontally scalable application:

Vertically scalable application:

7.2.6 Architecture Validation

In the architecture validation phase these questions are answered: How does the architecture …

7.3 Fan-Out Architecture

7.3.1 Overview

Calls are parallel instead of serial.

7.3.2 Reliability Issues in Dependencies

7.3.3 Fragments

Pages: Unique to customers, cannot be re-used

Page fragments:

7.3.4 Latency Reduction & Tolerance

7.3.5 Avoiding Getting Stuck

7.4 Containing Failures

7.4.1 Circuit Breakers

7.4.2 Bulkheads

7.4.3 Blast Reduction