Uni Distributed Systems Summary

Summary for the distributed systems course at HdM Stuttgart

Felicitas Pojtinger

2023-02-05

1.1 Meta

1.1.1 Contributing

These study materials are heavily based on professor Kriha’s “Verteilte Systeme” lecture at HdM Stuttgart and prior work of fellow students.

Found an error or have a suggestion? Please open an issue on GitHub (github.com/pojntfx/uni-distributedsystems-notes):

If you like the study materials, a GitHub star is always appreciated :)

1.1.2 License

SPDX-License-Identifier: AGPL-3.0

1.1.3 Course Timeline

This course on distributed systems covers a range of topics related to the design, implementation, and management of distributed systems. The course is divided into several sections, including:

Introduction to distributed systems: This section provides an overview of distributed systems and their key characteristics.
Theoretical models of distributed systems: This section covers the use of theoretical models, such as queuing theory and process and I/O models, to understand and analyze distributed systems.
Message protocols: This section covers the use of message protocols, including delivery guarantees, causality, and reliable broadcast, to facilitate communication between components in a distributed system.
Remote procedure calls: This section covers the use of remote procedure calls (RPCs) to invoke functions on a remote machine, as well as different RPC mechanisms such as marshaling, thrift, and gRPC.
Distributed objects: This section covers distributed objects such as CORBA, RMI and DCOM.
Distributed business components: This sections covers general component technology as well as Enterprise Java Beans (EJBs).
Services: This sections covers a distribution paradigm between hype and revolution.
Theoretical foundations of distributed systems: This section covers key concepts and theories that are relevant to the design of distributed systems, including the FLP theorem, time, causality, consensus, eventual consistency, and optimistic replication.
Distributed services and algorithms I: This section covers the design and implementation of various distributed services and algorithms, including load balancing, message queues, caching, and consistent hashing.
Distributed services and algorithms II: This section covers more advanced topics in distributed services, including persistence, transactions, eventual consistency, and coordination.
Design of distributed systems: This section covers the methodology and principles for designing distributed systems, as well as examples of different architectures and design patterns.

1.2 Introduction to Distributed Systems

1.2.1 Overview

Definition of a distributed system (DS)
Challenges for developers working with DS
Reasons for using a DS
Examples of DS
Characteristics of DS
Middleware for DS
Concepts and architectures in DS, including scale, parallelism, and latency
Resources related to DS

1.2.2 Reasons for Distributing Systems

1.2.2.1 Definition of a Distributed System

A system that is made up of independent agents that interact with each other and produce a cohesive behavior. The interactions and events in this system happen concurrently and in parallel, meaning that they occur simultaneously and independently of one another. This type of system is often used to perform tasks that are too complex or large for a single agent to handle, and the concurrent and parallel nature of the interactions allows for efficient and effective processing of the tasks.

1.2.2.2 Emergence

Four types of emergence: strong emergence, weak emergence, evolutionary emergence, and constructed emergence.

Strong emergence refers to situations where it is not possible to predict what will emerge from the interactions of the system.
Weak emergence refers to situations where simple principles combine to produce surprising results.
Evolutionary emergence refers to the complex but robust development of a system over time, such as the transformation of an egg into a human being.
Constructed emergence refers to the complex but often not robust emergence of a system that has been intentionally designed, such as a distributed system.

Emergent failure modes: Instances of cascading failures in constructed emergence.

1.2.3 Why are Distributed Systems complicated?

One reason is emergence, which refers to the complex and often unexpected behaviors that can arise when multiple components of a system interact with each other.
Another reason is the single machine view, which can make it difficult to understand and debug issues that arise in a distributed system.
Errors are also an inherent part of distributed systems, and developers must be prepared to handle and troubleshoot these errors.
Additionally, there is no “free lunch” in distributed systems, meaning that there is always some trade-off or cost associated with every design decision.
Finally, developing a distributed system involves total end-to-end system engineering, which can be a complex and time-consuming process.
All distributed systems algorithms are also based on the failures that are expected and how they are handled, which can add to the complexity of developing a distributed system.

1.2.3.1 Why Distribute a System?

There are several reasons why an organization might choose to use a distributed system:

Robustness/resilience: Distributed systems are designed to be resistant to single points of failure, which makes them more resilient and able to handle unexpected issues. This can be achieved through techniques such as replication, which allows data to be stored on multiple servers to ensure that it is still available even if one server fails.
Performance: Distributed systems can be designed to split processing into independent parts, which can improve overall performance by allowing different parts of the system to operate in parallel.
Scalability/throughput: Distributed systems can be designed to handle large numbers of requests per second, making them well-suited for applications that need to scale to handle high levels of traffic.
Security: Distributed systems can be used to create different security domains, which can help to protect sensitive data and ensure that it is only accessed by authorized users.
Price per request: Distributed systems can be designed to use cheaper horizontal scaling or free resources, which can help to reduce the cost of processing each request.

1.2.3.2 Examples of Distributed Systems

There are many examples of distributed systems, including:

Energy grids and telecom networks: These systems distribute electricity and communication signals across large geographic areas, often using complex networks of wires and other infrastructure.
Villages, towns, and cities: These systems are examples of distributed systems that are made up of many interconnected parts, including homes, businesses, roads, and other infrastructure.
IT infrastructure of large companies: Many large companies rely on distributed systems to manage their IT infrastructure, including servers, storage, and networks.
High-performance clusters: These systems are used to perform complex calculations and simulations, often in scientific or technical fields.
Google, Facebook, and other internet companies: These companies rely on distributed systems to power their online platforms and services, including search, social networking, and advertising.
The web: The internet itself is a distributed system, made up of millions of interconnected servers and devices that share information and resources.
The human body, organizations, and states: These systems are examples of distributed systems that are made up of many interconnected parts, each of which plays a specific role in the overall functioning of the system.
A flock of birds: A flock of birds is an example of a distributed system in nature, with each bird communicating and coordinating with the others to achieve a common goal.

1.2.4 Characteristics of Distributed Systems

1.2.4.1 Overview

Influence of distribution topology and remoteness: The physical layout of a distributed system and the distance between its components can have a significant impact on its performance and behavior.
Emergent behaviors and concurrent events: Distributed systems can exhibit complex and often unexpected behaviors as a result of the interactions between their components. Concurrent events, in which multiple parts of the system are executing at the same time, can also contribute to this complexity.
Few analytic solutions and few model-based approaches: There are often few analytic solutions or model-based approaches available for understanding and predicting the behavior of distributed systems, which can make them challenging to design and debug.
Heterogeneous components: Distributed systems often consist of a wide variety of components, each with its own hardware, software, and other characteristics. This can make it difficult to ensure that all components are compatible and work together effectively.
No global time: In a distributed system, there is no single, global clock that all components can use to coordinate their activities. This can make it difficult to synchronize the actions of different parts of the system.
A strong need for security: Distributed systems often handle sensitive data or perform critical functions, which makes security a key concern.
Concurrency, parallelism, and replication: Distributed systems often rely on concurrency, parallelism, and replication to improve performance and resilience.
Failure models define everything: The design and behavior of distributed systems are often shaped by the failure models that are used to define how the system should respond to different types of failures.

1.2.4.2 Eight Fallacies of Distributed Computing

The network is reliable: This fallacy assumes that the network is always available and that communication between components will always be successful. In reality, networks can experience outages or other problems that can disrupt communication.
Latency is zero: This fallacy assumes that there is no delay in communication between components, which is not always the case. Latency, or the time it takes for a message to travel from one component to another, can vary depending on the distance between components and other factors.
Bandwidth is infinite: This fallacy assumes that there is an unlimited amount of capacity available for transmitting data, which is not always the case. Network bandwidth is a finite resource that can become congested, leading to slower communication speeds.
The network is secure: This fallacy assumes that the network is invulnerable to security threats, such as hacking or data breaches. In reality, networks can be vulnerable to these types of threats, and it is important to implement appropriate security measures to protect against them.
Topology doesn’t change: This fallacy assumes that the physical layout of the network, or its topology, will not change over time. In reality, the topology of a network can change due to factors such as the addition or removal of components or changes in the physical infrastructure.
There is one administrator: This fallacy assumes that there is a single person or group responsible for administering the network, which is not always the case. Distributed systems can have multiple administrators, each with different responsibilities and roles.
Transport cost is zero: This fallacy assumes that there is no cost associated with transmitting data over the network, which is not always the case. In reality, there are often costs associated with networking, including hardware and software expenses and maintenance costs.
The network is homogeneous: This fallacy assumes that all components of the network are the same, which is not always the case. In reality, distributed systems often consist of a variety of components with different hardware, software, and other characteristics.

1.2.4.3 Programming Languages and Distributed Systems

There are two main approaches to programming languages and distributed systems: the transparency camp and the message camp.

The transparency camp focuses on hiding the complexity of a distributed system from the programmer. This can be achieved through techniques such as creating type-safe interfaces and calls, and hiding security, storage, and transactions behind frameworks such as .NET or Enterprise Java Beans (EJBs). This approach treats the distributed system as a programming model, rather than something that requires special handling.

The message camp, on the other hand, takes a more direct approach to programming distributed systems. This approach typically involves using a simple create, read, update, delete (CRUD) interface and using message content as the interface. Messages are often coarse-grained, meaning that they carry a large amount of data in a single message, often in the form of documents. Programmers in this camp deal with the complexity of remoteness directly, and architectures are often event-based or based on the representational state transfer (REST) model.

1.2.4.4 History of Distributed Systems

The history of distributed systems can be divided into several distinct periods:

1950s-1980s: During this period, basic research was conducted on topics such as time, consensus, and computability, which laid the foundations for the development of distributed systems.
1990s: In the 1990s, distributed systems were used to connect Intranet applications using technologies such as Common Object Request Broker Architecture (CORBA), Remote Procedure Calls (RPC), and Distributed Component Object Model (DCOM). Client-server web servers also became popular during this period. Programming models dominated the design and development of distributed systems.
2000s: In the 2000s, distributed systems were used to power peer-to-peer software for file sharing and large social sites emerged, which posed new challenges in terms of scalability and performance. Message passing and parallel batch processing techniques such as map/reduce were developed to address these challenges. This period also saw the emergence of in-memory computing and the CAP theorem, which established that it was not possible for a distributed system to simultaneously provide all three of the following properties: consistency, availability, and partition tolerance.
2010 and beyond: In the 2010s and beyond, distributed systems have been used to power large-scale warehousing systems, fan-out architectures, and real-time stream processing. Flash memory and network performance have become key considerations, and microservices and serverless computing have emerged as popular approaches to designing and building distributed systems.

1.2.4.5 Metcalfe’s Law

Metcalfe’s law is a principle that states that the value or utility of a network increases as the number of users in the network increases. This is because the more people who are using the network, the more useful it becomes as a platform for communication, collaboration, and the exchange of information and resources. The adoption rate of a network also tends to increase in proportion to the utility provided by the network, which is why companies often give away software or other products for free in order to increase the size of their user base and the value of their network.

Metcalfe’s law is often cited as a factor that can contribute to the emergence of scale-free, or power law, distributions in networks. This type of distribution is characterized by a few nodes (or users) with many connections, and many nodes with only a few connections. The existence of network effects, in which the value of a network increases with the number of users, can help to explain why we don’t see many Facebooks or Googles – it can be difficult for new networks to gain traction and achieve the same level of utility as established networks with a large user base.

1.2.4.6 Security Topics for Distributed Systems

Security is an important concern in distributed systems, as they often handle sensitive data or perform critical functions. Some key security topics that are relevant to distributed systems include:

Authentication: This refers to the process of verifying the identity of a user or device. In a distributed system, authentication may be used to ensure that only authorized users can access certain resources or perform certain actions.
Authorization: This refers to the process of granting or denying access to specific resources or actions based on the identity of a user or device. In a distributed system, authorization controls may be used to ensure that users can only perform actions that are appropriate for their role or level of access.
Confidentiality: This refers to the protection of information from unauthorized access or disclosure. In a distributed system, confidentiality may be achieved through techniques such as encryption or the use of secure channels for communication.
Integrity: This refers to the protection of information from unauthorized modification or tampering. In a distributed system, integrity may be maintained through techniques such as hashing or the use of digital signatures.
Non-repudiation: This refers to the ability to prove that a specific action or transaction was performed by a particular user or device. In a distributed system, non-repudiation may be achieved through techniques such as digital signatures or the use of timestamps.
Privacy/anonymity: These refer to the protection of personal information and the ability to use a system without revealing one’s identity. In a distributed system, privacy and anonymity may be achieved through techniques such as the use of pseudonymous identities or the use of encryption to protect communications.
Firewalls: A firewall is a security system that controls incoming and outgoing network traffic based on predetermined security rules. In a distributed system, firewalls can be used to protect against unauthorized access and to prevent malicious traffic from entering or leaving the system.
Certificates, public key infrastructure (PKI), and digital signatures: These are tools and techniques that are used to establish trust and authentication in a distributed system. Certificates, for example, can be used to verify the identity of a user or device, while PKI is a system that manages the issuance and revocation of certificates. Digital signatures are used to verify the authenticity of a message or document.
Encryption: Encryption is a technique that is used to protect information from unauthorized access or disclosure. In a distributed system, encryption can be used to secure communication channels and to protect data at rest. There are a variety of encryption methods and devices that can be used to achieve this goal.
Software architecture: The design and organization of the software components in a distributed system can have a significant impact on its security. It is important to consider security throughout the software development process and to design the system with security in mind.
Intrusion detection: This refers to the process of identifying and responding to unauthorized access or activity in a distributed system. Intrusion detection systems are used to monitor the system for signs of an attack or other security breach, and to alert administrators or take other appropriate action when an incident is detected.
Sniffing: This refers to the practice of intercepting and monitoring network traffic in order to gather information or to perform other malicious actions. In a distributed system, sniffing can be used to capture sensitive data or to disrupt communication.
PGP, SSL, etc.: These are tools and protocols that are used to secure communication in a distributed system. PGP (Pretty Good Privacy) is a data encryption and decryption program, while SSL (Secure Sockets Layer) is a protocol for establishing secure links between networked
Denial of service (DoS) attacks: These are attacks that are designed to disrupt the availability of a network or system by flooding it with traffic or other requests, thereby preventing legitimate users from accessing the system. In a distributed system, DoS attacks can be particularly disruptive as they can affect multiple components at once.

1.2.4.7 Theoretical Foundations of Distributed Systems

The theoretical foundations of distributed systems are a set of concepts and principles that form the basis for the design and analysis of these systems. Some of the key theoretical foundations of distributed systems include:

No global time: In a distributed system, it is not possible to rely on a single, global clock to coordinate events. Instead, techniques such as logical clocks and vector clocks are used to provide a partial ordering of events within the system.
FLP theorem of asynchronous systems: The FLP (Fischer, Lynch, and Patterson) theorem states that it is impossible to design an asynchronous distributed system that is both safe and live (that is, capable of making progress). This theorem highlights the challenges of building distributed systems that can handle failures or delays in communication.
Failure detection and timeout: One of the challenges of distributed systems is detecting failures or delays in communication and deciding how to respond to them. Timeout mechanisms are often used to detect failures, but setting appropriate timeout values can be difficult.
Concurrency and deadlocks: In a distributed system, multiple processes may execute concurrently, which can lead to situations where two or more processes are waiting for each other to complete before they can make progress. This is known as a deadlock, and can be difficult to resolve in a distributed system.
CAP theorem: The CAP (consistency, availability, partitioning) theorem states that it is impossible for a distributed system to simultaneously provide all three of the following properties: consistency, availability, and partition tolerance. This theorem highlights the trade-offs that must be made when designing a distributed system.
End-to-end argument: The end-to-end argument states that certain functions should be placed at the endpoints of a system, rather than in the middle, in order to maximize flexibility and modularity. This principle is often applied in distributed systems to determine where certain functions should be implemented.
Consensus, leader selection, etc.: Consensus is the process of agreeing on a single value or decision in a distributed system. This can be challenging in the presence of failures or delays, and requires mechanisms such as leader selection or voting to resolve.

1.2.4.8 Distributed Systems Design Fields

The design of a distributed system involves addressing a number of common problems and considering various architectural factors in order to create a system that is scalable, reliable, and secure. Some key considerations in distributed system design include:

Common problems: When designing a distributed system, it is important to consider a number of common problems that can impact the performance, reliability, and security of the system. These include issues such as fail-over, maintenance, policies, and security integration.
Information architecture: The information architecture of a distributed system refers to the way in which information is organized and structured within the system. This includes defining and qualifying the various information fragments and flows that make up the system.
Distribution architecture: The distribution architecture of a distributed system refers to the layout and organization of the various components of the system and how they are connected to each other. This includes creating a map of all participating systems and their quality of service, and determining how communication and resource sharing will be managed within the system.

1.2.5 Middleware

1.2.5.1 An Introduction to Middleware

Middleware is software that sits between the operating system and the application layer of a distributed system, providing a layer of abstraction that enables communication and resource sharing among the various components of the system. Some key characteristics of middleware include:

It is used to facilitate the creation of distributed applications.
Provides glue code and generators that allow different programming languages and systems to interoperate.
Controls messages and enforces delivery guarantees, such as at-least-once delivery.
Reorders requests from participants to create a causal or total ordering.
Takes over responsibility for messages and may store them temporarily.
Creates groups of nodes that process events together and controls fail-over.
Hides differences in hardware, location of services, and offers load balancing.
Allows filtering of requests or provides means to add additional security information to calls.
Provides powerful services such as locking, scheduling, and messaging to applications.
Offers frameworks that provide automatic storage, security checks, and transactional control.
Supports message bus architectures that provide loose coupling through publish/subscribe functions.

The importance of middleware in distributed systems cannot be overstated. It is essential for enabling communication and resource sharing among the various components of the system, and for abstracting away the complexities of working with distributed systems. Without middleware, it would be much more difficult to develop distributed applications that are scalable, reliable, and secure.

1.2.5.2 Distribution Transparencies

Distribution transparencies are features that are designed to hide the complexities of working with distributed systems from the user or developer. Some key distribution transparencies include:

Access: This transparency masks differences in languages and data representation, allowing different systems to communicate and exchange data with each other.
Failure: This transparency masks failures and enables fault tolerance through automated fail-over to other servers.
Scalability: This transparency provides intelligent load balancing of requests to ensure that the system can handle a large number of requests without becoming overloaded.
Redundancy: This transparency transparently replicates data to ensure that it is available even if one or more servers fail.
Location: This transparency allows users to access services using logical, rather than physical, names. This enables services to be moved or relocated without affecting the user experience.
Migration: This transparency hides the true location of a service or object from clients. If the location changes, the client will not notice.
Persistence: This transparency automatically loads and stores data on demand to unload server memory.
Sharding: This transparency distributes storage requests across multiple backend systems to ensure that the system can scale to handle a large volume of data.
Transactions: This transparency makes requests ACID (Atomic, Consistent, Isolated, and Durable), ensuring that they are executed correctly and consistently.
Security: This transparency automatically checks for the required credentials or roles in requests, ensuring that only authorized users can access resources.
Monitoring: This transparency creates central logs with correlation IDs that join request parts across nodes, enabling administrators to track and monitor the performance of the system.

1.2.5.3 Classification of Middleware

Middleware can be classified into several categories based on the type of service it provides and the way it communicates with other components in a distributed system. Some common types of middleware include:

Socket-based services: These are middleware systems that use sockets to communicate with other components in a distributed system. Sockets are a low-level communication mechanism that allows programs to send and receive data over a network.
Remote procedure calls (RPCs): These are middleware systems that allow programs to make calls to procedures or functions that are located on a remote machine, as if they were located on the local machine.
Object request brokers (ORBs): These are middleware systems that enable communication between objects that are running on different machines. ORBs provide an interface that allows objects to communicate with each other using a common protocol. Examples include CORBA and RMI.
Message-oriented middleware (MOMs): These are middleware systems that enable communication between components by exchanging messages. MOMs are often used in event-driven systems and reactive systems, which respond to external events or stimuli.
Web services: These are middleware systems that provide a way for different applications to communicate over the web using standard protocols such as XML-RPC, SOAP, and UDDI. Web services are often used in service-oriented architectures (SOA) and representational state transfer (REST) systems.
Frameworks: These are middleware systems that provide a set of tools and libraries for building distributed applications. Examples include Enterprise Java Beans (EJBs) and the Java 2 Enterprise Edition (J2EE).
Peer-to-peer (P2P) systems: These are middleware systems that enable communication and resource sharing among a group of computers or devices that are connected to each other. P2P systems do not rely on a central server, but rather allow each device to communicate directly with other devices in the network.
Agent-based systems: These are middleware systems that use software agents to communicate with other components in a distributed system. Agents are autonomous programs that are designed to perform a specific task or function. Examples include Jini and Aglets.
Tuple-spaces and distributed blackboards: These are middleware systems that use a shared memory space to enable communication and resource sharing among different components in a distributed system.
Warehouse-computing architectures: These are middleware systems that are designed to support the storage and processing of large volumes of data in a distributed environment, such as a data center.

1.3 Theoretical Models of Distributed Systems

1.3.1 Overview

Message passing theoretical model
Distributed computing topologies
Client-server systems, including critical points, architectures, processing and I/O models

1.3.2 Basic Principles

1.3.2.1 Synchronous vs. Asynchronous Systems

Synchronous and asynchronous systems are two types of distributed systems that differ in the way that they handle communication and the passage of time.

In a synchronous system, events are assumed to be delivered in a lockstep manner, with a fixed delay between the occurrence of an event and its delivery. This means that events are delivered at predetermined intervals, and the system can be designed to operate on the assumption that events will be delivered at these intervals.

Asynchronous systems do not have a fixed delay between the occurrence of an event and its delivery. Events may be delivered at any time, and the system must be able to handle this uncertainty. Asynchronous systems typically require more complex distributed algorithms to ensure correct operation, but they are generally easier to build and maintain than synchronous systems.

In practice, most distributed systems are asynchronous, with additional mechanisms such as failure detectors and randomization used to help ensure correct operation.

1.3.2.2 Properties of Message Protocols

Message protocol properties are characteristics that describe the desired behavior of a message passing protocol in a distributed system. These properties are used to ensure that the protocol operates correctly and achieves its intended goals.

Some common message protocol properties include:

Correctness: This property refers to the invariant properties of the protocol, which are properties that are expected to hold throughout all possible executions of the protocol. Ensuring the correctness of a protocol is important for ensuring that the protocol achieves its intended goals.
Liveness/termination: This property refers to the ability of the protocol to make progress in the context of certain failures and within a bounded number of rounds. A protocol that satisfies this property is said to be “lively” or “live”, while a protocol that does not satisfy this property is said to be “deadlock”.
Fairness: This property refers to the inability of any participant in the protocol to be “starved” or denied access to resources. A protocol that satisfies this property is said to be “fair”, while a protocol that does not satisfy this property is said to be “unfair”.
Agreement: This property refers to the ability of all participants in the protocol to agree on a specific decision or output value. Ensuring agreement is important for ensuring that the protocol achieves a consistent result.
Validity: This property refers to the ability of the protocol to output a result that is consistent with the input value. A protocol that satisfies this property is said to be “valid”, while a protocol that does not satisfy this property is said to be “invalid”.

1.3.2.3 Complexity of Distributed Algorithms

Time complexity refers to the amount of time it takes for an algorithm to complete. This is often measured in terms of the time of the last event before all processes finish.
Message complexity refers to the number of messages that need to be sent in order for the algorithm to complete. This includes both the number of messages sent and the size of the messages. The number of rounds needed for termination is also an important factor in the message complexity of an algorithm, as it can have a significant impact on the overall scalability of the protocol.

1.3.2.4 Failure Types

In distributed systems, there are several types of failures that can occur. These failures can have different impacts on the system and can require different approaches to handling them.

Some common types of failures in distributed systems include:

Crash failure: This type of failure occurs when a process stops working and remains down. This can be caused by a variety of issues, such as hardware or software problems.
Connectivity failures: This type of failure occurs when there is a problem with the network that connects the nodes in the system. This can cause “split brain” situations, where the system becomes divided into two separate networks, or node isolation, where a node becomes disconnected from the rest of the system.
Message loss: This type of failure occurs when individual messages are lost during transmission. This can be caused by a variety of issues, such as network problems or hardware failures.
Byzantine failures: This type of failure occurs when nodes in the system violate protocol assumptions and promises. This can include breaking promises due to disk or configuration failures, or intentionally behaving in a way that goes against the protocol. Byzantine failures are often considered to be the most difficult type of failure to handle in a distributed system, as they can be difficult to detect and can have significant impacts on the system.

1.3.2.5 Distributed Computing Topologies

There are several types of distributed computing topologies that can be used to design distributed systems. These topologies can have different characteristics and can be used to achieve different goals, depending on the needs of the system.

Some common types of distributed computing topologies include:

Client/server systems: In this type of topology, clients initiate communication with servers, which process the requests and send a response back to the client. This is the most common type of distributed system and is often used for applications where clients need to request specific information or services from servers.
Hierarchical systems: In this type of topology, every node can act as both a client and a server, but some nodes may play a special role, such as a domain name system (DNS) server. This type of topology can reduce communication overhead and provide options for central control, making it useful in certain types of systems.
Totally distributed systems: In this type of topology, every node is both a client and a server. This type of topology can be useful for systems where nodes need to communicate with each other directly, rather than relying on a central server.
Bus systems/pub-sub: In this type of topology, every node listens for data and posts data in response. This can be useful for event-driven systems where nodes need to communicate with each other asynchronously.

1.3.2.6 Queuing Theory: Kendall Notation M/M/m/ß/N/Q

The Kendall notation, also known as the Kendall notation for Markov chains, is a way of describing the behavior of a queuing system. It is often used in the field of operations research to analyze the performance of systems that have a finite number of servers and a finite queue size.

ß: Population Size (limited or infinite)
M,D,G: Probability distribution for arrivals
N: Wait queue size (can be unlimited)
Q: Service policy type (Fifo, shortest remaining time first etc)
M,D,G: Probability distribution for service time
m: Number of service channels

1.3.3 Queuing Theory

1.3.3.1 Generalized Queuing Theory Terms (Henry Liu)

Server/Node: A combination of a wait queue and a processing element
Initiator: The entity that initiates a service request
Wait time: The time a request or initiator spends waiting in line for service
Service time: The time it takes for the processing element to complete a request
Arrival rate: The rate at which requests arrive for service
Utilization: The percentage of time the processing element spends servicing requests, as opposed to being idle
Queue length: The total number of requests waiting and being serviced
Response time: The sum of the wait time and service time for a single visit to the processing element
Residence time: The total time spent by the processing element on a single transaction, if it is visited multiple times
Throughput: The rate at which requests are serviced, or how fast requests can be processed without long wait times.

1.3.3.2 Little’s Law

Little’s Law states that in a stable system, the long-term average number of customers (L) is equal to the long-term average effective arrival rate (λ) multiplied by the average time a customer spends in the system (W).
This can be expressed algebraically as L = λW.
Little’s Law is used to analyze and understand the behavior of systems that involve waiting, such as queues or lines. It can help to predict the average number of customers in a system, as well as the average time they will spend waiting, given a certain arrival rate.

1.3.3.3 Hejunka

Hejunka is a Japanese term that refers to the practice of leveling the production process by smoothing out fluctuations in demand and task sizes. It is often used in lean manufacturing and just-in-time (JIT) production systems to improve the efficiency and flow of work through a system.

The goal of Hejunka is to create a steady, predictable flow of work through the system by reducing variability in task sizes and demand. This can be achieved through a variety of methods, such as:

Setting limits on the number of tasks or requests that can be processed at any given time
Balancing the workload across different servers or processing elements
Prioritizing tasks based on their importance or impact on the overall system
Using techniques such as batching or grouping similar tasks together to reduce variability

By leveling the production process and reducing variability in task sizes, Hejunka can help to improve the efficiency and flow of work through a system, and reduce the risk of bottlenecks or delays caused by large differences in task size.

1.3.3.4 Lessons Learned from Queuing Theory

Request numbers: Caching can be used to reduce the number of requests that need to be processed by storing frequently accessed data in memory, so that it can be quickly retrieved without the need to fetch it from a slower storage medium.
Batching: The use of a multi-get API can help to reduce the number of requests that need to be processed by allowing multiple requests to be bundled together and processed as a single unit.
Task sizes and variability: Service level agreements (SLAs) can be used to define the acceptable level of variability in task sizes and completion times, and Hejunka is a technique that involves leveling the production process by smoothing out fluctuations in demand and task sizes. This can help to reduce variability and improve the efficiency of the system.

1.3.3.5 Request Problem in Multi-Tier Networks

In a multi-tier network, the request problem refers to the fact that requests must travel through multiple layers or tiers of servers in order to be processed, and each layer adds its own processing time and potential delays to the overall response time.

The average response time in a multi-tier network is therefore the sum of the trip average (the time it takes for a request to travel from one server to the next) multiplied by the wait time (the time a request spends waiting for a server to become available) at each layer, plus the sum of the service demand (the time it takes for a server to process a request) at each layer.

It is important to note that in a multi-tier network, all requests are synchronous and may be in contention with each other, which means that wait times can occur due to contention for server resources. This can impact the overall efficiency of the system and may require the use of techniques such as caching or batching to reduce the number of requests that need to be processed.

1.3.3.6 Task Size Problem in Multi-Tier Networks

In a multi-tier network, the task size problem refers to the fact that differences in task size can cause delays and inefficiencies in the processing of requests.

In the case of pipeline stalls between nodes, large differences in task size can cause requests to be held up at one node while waiting for the next node to become available, leading to delays in the overall response time.

1.3.3.7 Theories & Model vs. Reality

When applying queuing theory models to real-world systems, there are several factors that can impact the accuracy and usefulness of the model. These include:

Latency: Latency refers to the time it takes for a request to travel from one server to another or for a task to be completed. Latency can vary based on a variety of factors, such as network speed, server load, and the distance between servers, and it can impact the accuracy of queuing theory models.
Blocking/locking/serialization in service units: In real-world systems, servers may block or lock requests while they are being processed, or may process requests serially rather than in parallel. This can impact the accuracy of queuing theory models that assume parallel processing.
Non-random distributions and feedback effects: In real-world systems, request and task arrival rates may not always follow a random distribution, and there may be feedback effects that impact the flow of work through the system. This can make it difficult to accurately model the behavior of the system using queuing theory.
Dead requests: In some cases, requests may be lost or dropped due to errors or other issues, which can impact the accuracy of queuing theory models.
Backpressure: In systems where the capacity of servers or processing elements is limited, it is possible for requests to build up and create a backlog of work. This is known as backpressure, and it can impact the accuracy of queuing theory models.
Missing variables and coherence losses: Queuing theory models may not always include all the relevant variables or may not accurately account for coherence losses, which can impact their accuracy in predicting the behavior of real-world systems.

1.3.4 Client/Server Architectures

1.3.4.1 Critical Points in Client/Server Systems

On the client side:

Locating the server: The client must be able to locate the server in order to establish a connection. This process can be impacted by factors such as network speed and latency.
Authentication: The client may need to authenticate itself in order to access the server and its resources.
Sync/async: The client may be able to send requests synchronously or asynchronously, which can impact the overall performance of the system.
Speed up/down: The client may be able to adjust the speed at which it sends requests in order to optimize the performance of the system.
Load balancing: The client may need to use load balancing techniques to distribute requests evenly across multiple servers or processing elements.
Queues: The client may need to use queues to manage requests and ensure that they are processed in an orderly manner.

On the server side:

Many clients: The server may need to handle requests from many clients simultaneously, which can impact its performance and efficiency.
Session state: The server may need to maintain session state information in order to track the progress of individual requests.
Authentication: The server may need to authenticate clients
Authorization: The server may need to authorize clients to access specific resources or perform certain actions.
Privacy: The server may need to ensure that client data is kept private and secure.
Sync/async: The server may be able to process requests synchronously or asynchronously, which can impact the overall performance of the system.
Blocking: The server may block requests while they are being processed, which can impact the performance of the system.
Single/multicore CPU intensive: The server may be able to use multiple cores or processors to process requests in parallel, or it may be limited to processing requests serially on a single core.
Queues: The server may need to use queues to manage requests and ensure that they are processed in an orderly manner.

Between both: Bandwidth/latency; The bandwidth and latency of the network connection between the client and the server can impact the performance and efficiency of the system.

1.3.4.2 Stateful vs Stateless Systems (The Stateful Server Problem)

The stateful server problem refers to the trade-offs that must be considered when designing and implementing a server-based system that maintains state information.

On the one hand, stateful servers have several advantages:

Data locality: Stateful servers can store data locally, which can improve the performance of the system by reducing the need to fetch data from external storage.
Consistency: Stateful servers can ensure that data is consistent and up-to-date, which can be important in certain applications.

However, stateful servers also have some disadvantages:

Availability: Stateful servers may be less available than stateless servers, as they may be more vulnerable to failures or downtime.
Load balancing: Stateful servers may be more difficult to load balance than stateless servers, as they must maintain state information for individual clients.

Stateless design, on the other hand, stores all data in external storage such as databases or caches, which can make it easier to design and implement the system. However, in the event of failures, programming stateless systems can be more difficult, as all data must be retrieved from external storage.

1.3.4.3 Terminology for Client/Server Systems

Host: A physical machine with n CPUs.
Server: A process running on a host that receives messages, performs computations, and sends messages (not necessarily responses).
Thread: An independent computation context within a process, which can be pre-empted by the kernel (kernel-thread) or yield voluntarily (application-level scheduling).
Multi-threading: The use of multiple threads within a single process context. This can be achieved using kernel-level threading (where the kernel switches between threads) or using multiple kernel threads running in parallel on a multi-core system.
Multi-channel: A thread that is able to watch multiple channels using a single system call, such as a select() call.
Synchronous processing: A caller calls a function and waits for its results, doing nothing while waiting.
Asynchronous processing: A caller calls a function and immediately continues executing its own code. The called function is eventually executed, and a callback function is called to inform the caller about the completion.
Parallel processing: The deterministic execution of independent code paths.
Blocking: A thread calls a function that needs time to complete, such as fetching a resource from disk. The thread cannot continue and blocks an execution core while waiting for the result. The thread is “context switched” and a new code path is loaded and executed by the core.
Non-blocking calls: A caller calls the non-blocking version of a function. If the function can perform immediately without delaying the caller, it will do so. If the function needed time to perform its job, it will allow the caller to return immediately and inform it that it would be blocked. The caller can then decide to do something else and try again later (poll again).
Synchronization: The control of access to shared data by multiple threads in order to prevent data inconsistencies.

1.3.4.4 Overarching Client/Server Architectures

Multi-Tier System: This type of architecture involves splitting up the system into multiple layers or tiers, each of which performs a specific set of functions. The tiers may include a presentation layer, a business logic layer, and a data storage layer, among others.
Large fan-out Architectures: This type of architecture involves a central component that receives requests from many clients and distributes them to multiple servers or other resources. This can allow the system to scale more easily and handle a large volume of requests.
Pipeline (SEDA): This type of architecture involves breaking up the processing of a request into multiple stages, each of which is handled by a separate component. The stages are connected in a pipeline, with each stage processing the request and passing it on to the next stage.
Offline Processing: This type of architecture involves processing requests and tasks in an offline or deferred manner, rather than in real-time. This can be useful in situations where the workload is too large to handle in real-time, or where real-time processing is not required.

1.3.5 Process and I/O Models

1.3.5.1 Different Process Models

Single Thread/Single Core: This type of process model involves a single thread of execution running on a single core. This can be efficient for certain types of workloads, but may not be able to take full advantage of multiple cores or processors.
Multi-Thread/Single Core: This type of process model involves multiple threads of execution running on a single core. This can allow the system to perform multiple tasks concurrently, but may not be able to fully utilize the processing power of multiple cores or processors.
Multi-Thread/Multi-Core: This type of process model involves multiple threads of execution running on multiple cores or processors. This can allow the system to fully utilize the processing power of multiple cores or processors, and can be more efficient for certain types of workloads.
Single Thread/Multi-Process: This type of process model involves a single thread of execution running within each of multiple processes. This can allow the system to take advantage of multiple cores or processors, but may be less efficient than other models for certain types of workloads.

1.3.5.2 Questions for Process Models

Can it use available cores/CPUs?
What is the ideal number of threads?
How does it deal with delays/(b)locking?
How does it deal with slow requests/uploads?
Is there observable non-determinism aka race conditions?
Is locking/synchronization needed?
What is the overhead of context switches and memory?

1.3.5.3 Amdahl’s Law

According to Amdahl’s Law, the maximum improvement in overall system performance that can be achieved by improving a particular part of the system is limited by the fraction of time that the improved part of the system is used. In other words, if only a small portion of the system’s workload is affected by the improvement, the overall improvement in performance will also be small.

speedup = \frac{1}{(1-parallelfraction+\frac{parallelfraction}{numberofprocessors}}

For example, if a particular part of a system is improved so that it runs twice as fast, but that part of the system is only used 10% of the time, the overall improvement in system performance will be limited to a 10% increase. On the other hand, if the improved part of the system is used 50% of the time, the overall improvement in performance will be much larger, at 50%.

Amdahl’s Law is often used to understand the potential benefits and limitations of optimizing or improving specific parts of a system. It can be a useful tool for determining how much resources should be invested in improving a particular part of the system, and for understanding the potential impact of those improvements on overall system performance.

1.3.5.4 Different I/O Models

Synchronous Blocking (Java before NIO/AIO): Prior to the introduction of the Java New I/O (NIO) and Asynchronous I/O (AIO) APIs, Java had a different model for handling input/output (I/O) operations. This model involved using threads to block and wait for I/O operations to complete, which could be inefficient and consume a lot of system resources.
Synchronous Non-Blocking (Polling pattern): The polling pattern is a way of handling I/O operations in which a central component periodically checks for the completion of I/O operations. This can be done by repeatedly calling a function that checks the status of the operation, or by using a timer to trigger the check at regular intervals.
Asynchronous Blocking (Reactor pattern): The Reactor pattern is a way of handling I/O operations in which a central component is notified when an I/O operation is completed, rather than periodically checking for its completion. This can be more efficient than the polling pattern, as it allows the system to respond to I/O operations as soon as they are completed, rather than waiting for a periodic check.
Asynchronous Non-Blocking (Proactor pattern): The Proactor pattern is similar to the Reactor pattern, but it is designed to handle high-concurrency environments where many I/O operations are occurring simultaneously. It uses a combination of asynchronous I/O and event-driven design to allow for efficient handling of multiple I/O operations at once.

1.3.5.5 Questions for I/O Models

Can it deal with all kinds of input/output?
How are synchronous channels integrated?
How hard is programming?
Can it be combined with multi-cores?
Scalability through multi-processes?
Race conditions possible?

1.4 Message Protocols

1.4.1 Overview

Message protocols
1. Delivery guarantees in point-to-point communication
2. Reliable broadcast
3. Request ordering and causality
Programming client-server systems using sockets

1.4.2 Delivery Guarantees

1.4.2.1 The Role of Delivery Guarantees

Shop order: The scenario described involves an online shop in which orders are placed and processed. The goal is to ensure that orders are delivered correctly and efficiently, regardless of any potential issues that may arise.

TCP Communication properties: TCP (Transmission Control Protocol) is a networking protocol that is used to establish and maintain communication between devices over a network. It has several key properties that are relevant to the scenario described, including reliability, flow control, and congestion control.
At-least-once: The “at-least-once” delivery guarantee means that a message may be delivered more than once, but it will always be delivered at least once. This can be useful in situations where it is more important to ensure that a message is delivered, even if it may be duplicated, than it is to prevent duplicates from occurring.
At-most-once: The “at-most-once” delivery guarantee means that a message will be delivered at most once. This can be useful in situations where it is more important to prevent duplicates from occurring than it is to ensure that a message is always delivered.
Exactly once: The “exactly once” delivery guarantee means that a message will be delivered exactly once, with no duplicates. This can be more difficult to achieve than the other delivery guarantees, as it requires additional complexity and overhead to ensure that duplicates are prevented.
Message complexity: The number of messages sent refers to the total number of messages that are transmitted as part of the communication process. In the scenario described, the number of messages sent may affect the efficiency and reliability of the communication process, and may need to be taken into account when determining the appropriate delivery guarantee to use.

1.4.2.2 Why is TCP not Enough?

While TCP (Transmission Control Protocol) is a widely used networking protocol that provides a reliable communication channel between devices, it is not always sufficient on its own to ensure proper behavior in all situations. Here are some reasons why TCP may not be enough:

Lost messages retransmitted: TCP includes mechanisms for retransmitting lost messages, which can help to improve the reliability of communication. However, if messages are frequently lost or the network is particularly unreliable, the overhead of retransmitting lost messages may become a burden on the system.
Re-sequencing of out of order messages: TCP includes mechanisms for reordering out-of-order messages, which can help to ensure that messages are delivered in the correct order. However, if messages are frequently delivered out of order, this can be inefficient and may cause issues with the overall communication process.
Sender choke back (flow control): TCP includes flow control mechanisms that allow the sender to adjust its rate of transmission based on the capacity of the receiver. However, if the sender is sending messages too quickly, this can lead to congestion on the network and reduced performance.
No message boundary protection: TCP does not provide any protection for message boundaries, which means that messages may be broken up or combined during transmission. This can make it difficult to ensure that messages are delivered in their entirety and can cause issues with the overall communication process.
Timeout problem: TCP includes mechanisms for detecting and handling connection failures, but these mechanisms may not be sufficient in all situations. For example, if a connection is lost due to a timeout, it may take some time to detect and recover from the failure, which can lead to delays and disruptions in the communication process.

To address these and other issues, it may be necessary to use additional protocols or techniques, such as message-oriented middleware or application-level protocols, to ensure proper behavior in case of connection failures and to provide additional features and guarantees for message delivery.

1.4.2.3 Different Levels of Timeouts

Business-Process-Timeout: This timeout is set at the business process level and is used to ensure that a business process does not get stuck or take too long to complete. This timeout may be triggered if a particular task or operation within the process takes longer than expected to complete, or if the process as a whole takes too long to finish.
RPC-Timeout (order progress): This timeout is set at the level of remote procedure calls (RPCs) and is used to ensure that RPCs do not get stuck or take too long to complete. This timeout may be triggered if an RPC takes longer than expected to complete, or if the progress of an RPC is not being monitored properly.
TCP-Timeout (reliable channel): This timeout is set at the TCP (Transmission Control Protocol) level and is used to ensure that the reliable communication channel provided by TCP is functioning properly. This timeout may be triggered if a connection is lost or if the channel becomes congested or otherwise unstable.

1.4.2.4 Delivery Guarantees for RPCs

Best effort: The “best effort” delivery guarantee means that no specific guarantees are made about the delivery of requests or responses. This means that requests may be lost or responses may not be received, and there is no mechanism in place to ensure that this does not happen.
At least once: The “at least once” delivery guarantee means that a request may be delivered more than once, but it will always be delivered at least once. This can be useful in situations where it is more important to ensure that a request is delivered, even if it may be duplicated, than it is to prevent duplicates from occurring.
At most once: The “at most once” delivery guarantee means that a request will be delivered at most once. This can be useful in situations where it is more important to prevent duplicates from occurring than it is to ensure that a request is always delivered.
Once and only once/exactly once: The “once and only once” or “exactly once” delivery guarantee means that a request will be delivered exactly once, with no duplicates. This can be more difficult to achieve than the other delivery guarantees, as it requires additional complexity and overhead to ensure that duplicates are prevented.

1.4.3 Idempotency

1.4.3.1 Overview

Idempotency is a property of operations or requests that ensures that they can be safely repeated without changing the result of the operation. In other words, if an operation is idempotent, it will have the same result whether it is performed once or multiple times.

The first request needs to be idempotent: In a sequence of requests, it is important that the first request is idempotent. This ensures that the first request can be safely repeated if it fails or is lost, without affecting the overall result of the operation.
The last request can be only best effort
Messages may be reordered

1.4.3.2 Server State and Idempotency

Idempotency is an important property to consider when designing operations or requests that may be repeated or delivered multiple times, as it can help to ensure that the operation or request is able to be safely repeated without affecting the overall result. Here are some additional considerations related to idempotency and server state:

No need to remember a request and its result: If an operation or request is idempotent, the server does not need to remember the request or its result. This can be useful in situations where the server’s storage is limited or unreliable, as it means that the server does not need to maintain a record of all previous requests and their results.
Server can lose its storage: If the server’s storage is lost or becomes unavailable, it should not affect the overall result of the operation or request, as long as the operation or request is idempotent. This can help to ensure that the operation or request is able to be safely repeated even if the server’s storage is lost.
Concurrent updates might be consistent without concurrency control: If an operation or request is idempotent, it is possible that concurrent updates to the same data may be consistent without the need for concurrency control mechanisms such as locks or transactions. This can help to improve the efficiency and performance of the system.

1.4.3.3 Implementing Delivery Guarantees for Idempotent Requests

“At least once” implementation for idempotent requests: For idempotent requests, the “at least once” delivery guarantee can be implemented by simply sending an acknowledgement (ack) to the client after the request has been received. This approach does not require any updates to the server state, and is suitable for requests that do not have any critical side effects.
“At most once” implementation for nonidempotent requests: For nonidempotent requests, the “at most once” delivery guarantee can be implemented by storing a response on the server until the client confirms that it has been received. This approach requires the server to maintain state for each response, and may involve adding a request number to each request to help the server detect and discard duplicate requests.
“Exactly once” implementation: The “exactly once” delivery guarantee is not possible to achieve in asynchronous systems with network failures. However, it can be approximated using techniques such as two-phase commit and epoch numbers, which allow the client and server to coordinate their actions and ensure that they do not forget their decisions. This approach may involve maintaining an atomic log on both the client and server, and storing responses on the server until they are confirmed by the client

1.4.3.4 Repeating Non-Idempotent Operations

If an operation is not idempotent, it means that it cannot be safely repeated multiple times and is likely to have unintended side effects. In this case, there are several measures that can be taken to ensure reliable communication:

Use a message ID to filter for duplicate sends: By including a unique message ID in each request, the server can filter out duplicates and only execute the request once.
Keep a history list of request execution results on the server: If the reply to a request is lost, the server can retransmit the result from its history list. This helps to ensure that the client receives the correct result even if the initial reply was lost.
Lease resources on the server: In some cases, it may be necessary to keep state on the server in order to facilitate communication. For example, a client may “lease” resources on the server for a specific period of time. This can help to ensure that resources are used efficiently and released when they are no longer needed.

It is important to carefully manage state on the server in order to avoid overloading the system with old replies. It may be necessary to set limits on how long to store old replies and when to discard them.

1.4.4 Order

1.4.4.1 Request Order in Multi-Point-Protocols

No request order from one sender: In a multi-point protocol, there is no guaranteed order for requests sent by a single sender. This means that requests may be received and processed in a different order than they were sent, and the sender should be prepared to handle this possibility.
No request order between different senders: In a multi-point protocol, there is no guaranteed order for requests sent by different senders. This means that requests from different senders may be received and processed in a different order than they were sent, and the senders should be prepared to handle this possibility.
No request order between independent requests of different senders: In a multi-point protocol, there is no guaranteed order for independent requests sent by different senders. This means that independent requests from different senders may be received and processed in a different order than they were sent, and the senders should be prepared to handle this possibility.

1.4.4.2 Request Ordering with Multiple Nodes

In a multi-node system, it may be necessary to use a reliable broadcast protocol to ensure that requests are processed in the desired order. Here are some examples of protocols that can be used for request ordering with multiple nodes:

Reliable Broadcast: Reliable broadcast is a protocol that ensures that a message is delivered to all nodes in the system, and that it is delivered in the same order to all nodes. This can help to ensure that requests are processed in the correct order, even if they are sent from different nodes or if there are delays or other issues with the network.
FIFO Cast: FIFO cast is a protocol that ensures that messages are delivered in the order in which they were sent, with the first message sent being the first one to be delivered. This can help to ensure that requests are processed in the correct order, even if they are sent from different nodes or if there are delays or other issues with the network. This can still lead to causal inconsistencies!
Causal Cast: Causal cast is a protocol that ensures that messages are delivered in a causally consistent order, based on the dependencies between the messages. This can help to ensure that requests are processed in the correct order, even if they are sent from different nodes or if there are delays or other issues with the network.
Absolutely Ordered Casts: Absolutely ordered casts is a protocol that ensures that messages are delivered in a totally ordered sequence, with no uncertainty about the order in which the messages were sent. This can help to ensure that requests are processed in the correct order, even if they are sent from different nodes or if there are delays or other issues with the network.

1.4.4.3 Implementing Causal Ordered Broadcasts

Piggybacking previous messages: One solution for implementing causal ordered broadcasts is to piggyback every message sent with the previous messages. This means that when a message is sent, it is accompanied by the previous messages that it depends on. This can help to ensure that processes that may have missed a message can learn about it with the next incoming message and then deliver it correctly.
Sending event history with every message: Another solution for implementing causal ordered broadcasts is to send the event history with every message. This can be done using techniques such as vector clocks, which are used to track the dependencies between events in a distributed system. With this approach, messages are not delivered until the order is correct. This can help to ensure that messages are delivered in the correct order, even if there are delays or other issues with the network.

1.4.4.4 Implementing Absolutely Ordered Casts

All nodes send messages to every other node: One solution for implementing atomic broadcasts is for all nodes to send their messages to every other node in the system. This ensures that all nodes have a complete set of messages, which can be used to determine the total order of the messages.
All nodes receive messages, but wait with delivery: After receiving all of the messages, all nodes can wait with delivery until the total order of the messages has been determined.
One node is selected to organize the total order: To determine the total order of the messages, one node can be selected to organize the messages into a total order. This node can use a variety of techniques, such as vector clocks or Lamport timestamps, to determine the order of the messages.
The node sends the total order to all nodes: Once the total order has been determined, the node can send the total order to all other nodes in the system.
All nodes receive the total order and deliver their messages: Finally, all nodes can receive the total order and deliver their messages according to the determined order.

There are however disadvantages with these implementations:

May have high overhead: This solution may have high overhead, as it requires all nodes to send and receive messages from every other node in the system. This can be particularly problematic in large systems with many nodes, as it may result in many messages being transmitted and processed.
May have high latency: This solution may also have high latency, as it requires all nodes to wait for the total order to be determined before delivering their messages. This can be particularly problematic in systems where low latency is critical.

1.4.5 Sockets

1.4.5.1 Properties of Sockets

Sockets are a programming interface that enables communication between networked computers. They are used to send and receive data over a network connection using either TCP (Transmission Control Protocol) or UDP (User Datagram Protocol) connections.

Socket properties include:

Using either TCP or UDP connections: Sockets can use either TCP or UDP connections to communicate with other computers over a network. TCP provides a reliable, connection-oriented communication channel, while UDP provides a connectionless, unreliable communication channel.
Serving as a programming interface: Sockets provide a programming interface that enables applications to send and receive data over a network connection. They are typically used in conjunction with other programming constructs, such as threads or event loops, to enable concurrent communication and data processing.
A specification of “Host”, “Port”, “Connection type”: Sockets are identified by a combination of a host, port, and connection type. The host specifies the address of the computer on which the socket is running, the port specifies a specific communication endpoint on the host, and the connection type specifies whether the socket is using TCP or UDP.
A unique address of a channel endpoint: Each socket is assigned a unique address that identifies it as a communication endpoint. This address is used to identify the socket when sending or receiving data over the network.

1.4.5.2 Berkeley Sockets

Berkeley sockets provide a set of primitives or functions that can be used to create, manage, and manipulate network connections and communication channels.

Here is a brief description of the meaning of each of the Berkeley Sockets primitives:

Socket: The socket primitive creates a new communication endpoint, or socket, that can be used to send and receive data over a network connection.
Bind: The bind primitive attaches a local address to a socket, specifying the address and port on which the socket will listen for incoming connections or data.
Listen: The listen primitive announces the willingness of a socket to accept incoming connections from other hosts.
Accept: The accept primitive blocks the caller until a connection request arrives at the socket. When a connection request is received, it creates a new socket for the connection and returns a reference to the new socket.
Connect: The connect primitive actively attempts to establish a connection to a remote host by sending a connection request to the remote host.
Send: The send primitive sends some data over the connection associated with a socket.
Receive: The receive primitive receives some data over the connection associated with a socket.
Close: The close primitive releases the connection associated with a socket, allowing it to be used for other purposes.

1.4.5.3 Server Side Processing using Processes

Server-side processing using processes is a model for handling incoming requests in a networked system. It involves the following steps:

Server: The server listens on a specified port, waiting for incoming connection requests.
Client: A client connects to the server on an arbitrary port and establishes a connection between the client and the server.
Server: The server accepts the connection request and spawns a new process to handle the request. The process is assigned to the same port as the original connection, and the server goes back to listening for new connection requests.

This model allows the server to scale to some degree, as it can handle multiple requests concurrently by spawning new processes to handle each request. However, process creation can be expensive, and there may be limits on the number of processes that can be created on a given system. An example of this model is traditional CGI processing in a web server, where a new process is spawned to handle each incoming request.

1.4.5.4 Server Side Processing using Threads

Server-side processing using threads is a model for handling incoming requests in a networked system that is similar to the process-based model, but uses threads instead of processes to handle requests. It involves the following steps:

Server: The server listens on a specified port, waiting for incoming connection requests.
Client: A client connects to the server on an arbitrary port and establishes a connection between the client and the server.
Server: The server accepts the connection request and spawns a new thread to handle the request. The thread is assigned to the same port as the original connection, and the server goes back to listening for new connection requests.

This model allows the server to scale well, as it can handle multiple requests concurrently by spawning new threads to handle each request. Thread creation is less expensive than process creation, and it is possible to create a larger number of threads on a given system. An example of this model is servlet request processing in a servlet engine, also known as a “web container”.

In a threaded server, it is important to ensure that the functions that are used to handle incoming requests are re-entrant, or able to be safely called concurrently by multiple threads. This is because multiple threads may be executing the same function simultaneously, and the function must be able to handle this without causing problems.

To ensure that functions are re-entrant, it is important to avoid using unprotected global variables, as these can be modified concurrently by multiple threads, leading to potential race conditions and other problems. Instead, state should be kept on the stack, with each thread having its own copy of the state. This ensures that each thread has its own private version of the state, and can operate independently of other threads.

1.4.5.5 Design Considerations for Socket-Based Services

Message formats: The message formats that will be exchanged between clients and servers should be carefully designed to ensure that data can be transmitted and received correctly. This includes ensuring that data is represented consistently on different hardware platforms and avoiding problems caused by different data representations.
Protocol design: The protocol that will be used by clients and servers to communicate should also be carefully designed. This includes deciding whether clients will wait for answers from the server (synchronous communication) or whether communication will be asynchronous, whether the server can call back to the client, whether the connection will be permanent or closed after each request, whether the server will hold client-related state (e.g. a session), and whether the server will allow concurrent requests.

1.4.5.6 Stateless vs. Stateful Socket-Based Services

A stateless service is one in which the server does not store any information about previous requests or maintain any state between requests. This can have several advantages, including:

Scaling extremely well: Because the server does not maintain any state between requests, it can handle a large number of requests concurrently without running out of resources or becoming overloaded.
Making denial of service attacks harder: Because the server does not maintain any state between requests, it is more resistant to denial of service attacks, as attackers cannot exhaust resources by sending a large number of requests that require the server to store state.
Forcing new authentication and authorization per request: A stateless service can require clients to authenticate and authorize each request separately, improving security by requiring clients to prove their identity and permissions for each request; this can however also be a disadvantage, as it requires a significant overhead to transfer the authn credentials each time.

On the other hand, a stateful service is one in which the server stores information about previous requests and maintains state between requests. This can have several advantages, including:

Allowing transactions and delivery guarantees: Because the server maintains state between requests, it is possible to use transactions and guarantee delivery of requests.
Enabling more complex interactions: A stateful service can support more complex interactions between clients and servers, as it allows the server to track the state of the interaction and respond appropriately.

However, stateful services can also have some drawbacks, including:

Risk of resource exhaustion: Because the server maintains state between requests, it is possible for the server to run out of resources, such as sockets, if it receives many requests that require it to store state.
Need for reliable hardware and networks: In order to function correctly, stateful services typically require reliable hardware and networks, as they rely on the server being able to maintain state between requests. If the server or network experiences failures, this can disrupt the stateful interaction.

1.4.5.7 TCP SYN Flooding

TCP SYN flooding is a type of denial of service (DoS) attack that exploits a weakness in the TCP connection establishment process. In a normal TCP connection, the client sends a SYN (synchronize) packet to the server to initiate the connection, and the server responds with a SYN-ACK (synchronize-acknowledge) packet to confirm that the connection can be established. The client then sends an ACK (acknowledge) packet to complete the three-way handshake and establish the connection.

In a TCP SYN flooding attack, the attacker sends a large number of SYN packets to the server, either from a single machine or from a network of compromised machines. The server responds to each SYN packet with a SYN-ACK packet, but the client never completes the three-way handshake by sending an ACK packet. As a result, the server is left waiting for the ACK packet, and the resources used to track the half-open connections can be exhausted, preventing the server from accepting any new connections.

TCP SYN flooding attacks can be mitigated by implementing SYN cookies, which use cryptographic techniques to allow the server to track half-open connections without using any resources. Other measures, such as rate limiting and filtering of incoming SYN packets, can also help to prevent or mitigate the effects of TCP SYN flooding attacks.

1.4.5.8 Writing a Socket Client & Server

For the server:

Define the port number of the service: The server should specify the port number on which it will listen for incoming connections. For example, the HTTP server typically listens on port 80.
Allocate a server socket: The server creates a server socket and binds it to the specified port number. The server socket then listens for new connections.
Accept an incoming connection: When a client attempts to connect to the server, the server socket accepts the connection and creates a new socket for the client connection.
Get the input channel: The server reads the input channel from the socket to receive messages from the client.
Parse the client message: The server parses the client message to understand the request being made.
Get the output channel: The server gets the output channel from the socket, which is where it will send responses to the client.
Do request processing: The server processes the client request, either by handling it directly or by creating a new thread to handle it.
Create a response message: The server creates a response message, such as an HTTP response, to send back to the client.
Write the message to the output channel: The server writes the response message to the output channel, which sends it back to the client.
Read new messages or close the connection: The server can either read new messages from the client’s input channel or close the connection if it is no longer needed.

For the client:

Define the hostname and port number of the server host
Allocate a socket with the host and port parameters
Obtain the input channel (for messages from the server) and output channel (for messages to the server) from the socket
Create a message to send to the server, such as “GET /somefile.html HTTP/1.0”
Write the message to the output channel to send it to the server
Read the response from the input channel and display it

For a multithreaded client:

One thread reads from the console and writes to the output channel
The other thread reads from the input channel and displays or writes the server messages

1.4.5.9 Distribution Transparency with Sockets

Invocation: Server-side functions cannot be called directly from the client side, and messages must be defined and sent using socket operations
Location/relocation/migration: If the service moves, the client connection will be broken
Replication/concurrency: Not supported
Failure: Not supported
Persistence: Not supported

1.4.5.10 Infrastructure of Client-Server Systems

Directory: Helps locate the server
Proxy: Checks client authorization and routes requests through the firewall
Firewall: Allows outgoing calls only
Reverse proxy: Caches results, ends SSL sessions, and authenticates clients
Authentication server: Stores client data and authorizes clients
Load balancer: Distributes requests across servers

1.5 Remote Procedure Calls

1.5.1 Overview

Call versions: local, inter-process, and remote
Mechanics of remote calls:
1. Marshaling/serialization
2. Data representation
3. Message structure and schema evolution
4. Interface definition language
5. Tooling: generators
Cross-language call infrastructures: Thrift and gRPC

1.5.2 Types of RPCs

1.5.2.1 Call Versions

Local calls: made within a single process and do not require network communication
Inter-process calls: Made between processes on the same machine and can be implemented using a variety of mechanisms, such as shared memory or pipes
Remote calls: Made between processes on different machines and typically require network communication

1.5.2.2 Remote Calls vs. Remote Messages

Remote calls: Hide the remote service calls behind a programming language call and are tightly coupled with synchronous processing
Remote messages: Use a message-based middleware and create a new concept of a message and its delivery semantics; a message system can simulate a call-based system, but not vice versa

1.5.2.3 In-Process (Local) Calls

No special middleware is required for calls made within the same programming language
Calls into the OS are not inter-process calls
Special attention is required for cross-language calls made within a single process, such as calls to native code in Java
In-process calls are fast because they do not require network communication or context switches between processes.
Have “exactly once” semantics, meaning that they will be executed only once and will not be retried if they fail.
Are type-safe and link-safe, meaning that they are checked for type compatibility at compile time and will not result in linker errors. However, they may have issues with dynamic loading if the called code is not available at runtime.
Can be either sequential or concurrent, depending on the design of the application. Concurrent in-process calls can be implemented using threads or other concurrency mechanisms.
Operate within a single name and address space, which means that they can access all variables and functions within the process without requiring additional addressing or mapping.
Are independent of byte ordering, which means that they can be used on any machine with any endianness without requiring additional conversions.
Can be controlled in their memory usage, such as through garbage collection, which can help prevent memory leaks and improve the performance and stability of the application.
Can use value or reference parameters, depending on the needs of the application. Value parameters are copied when passed to a function, while reference parameters are passed by address and can be modified within the called function.
Are transparent programming language calls and not explicit messages, which means that they appear to the programmer as regular function calls and do not require the creation and handling of explicit messages.

1.5.2.4 Inter-Process Calls (IPC)

Local inter-process calls are faster than remote calls, but not as fast as in-process calls
Do not have “exactly once” semantics
Are type-safe and link-safe if both processes use the same static libraries, but may have issues with dynamic loading
Can be either sequential or concurrent, but the caller no longer controls the execution; the receiver must protect itself from concurrent access
Cannot assume a single name and address space and require additional addressing or mapping
Are still independent of byte ordering
Require cross-process garbage collection to manage memory usage
Can only use value parameters, as the target process cannot access memory in the calling process
Are not real programming language “calls” and require the creation of messages to simulate the missing features
Inter-process communication is not local and introduces latency, memory barriers, and the possibility of process failures

The good news is that inter-process communication occurs on the same hardware and uses the same language at the sender and receiver, which can reduce security issues and ensure that a system crash affects both sender and receiver (fail-stop semantics)

1.5.2.5 Remote Procedure Calls (RPC)

Remote procedure calls (RPCs) are much slower than local calls and do not have delivery guarantees without a protocol
May have version mismatches that show up at runtime
Are concurrent and the caller no longer controls the execution; the callee must protect itself from concurrent access
Cannot assume a single name and address space and require additional addressing or mapping
Are affected by byte ordering
May require network garbage collection if they are stateful
May involve cross-language calls
Can only use value parameters, as the target process cannot access memory in the calling process
Are not real programming language “calls” and require the creation of messages to simulate the missing features
Often stateless

1.5.3 Implementing Remote Calls

1.5.3.1 Overview

Marshaling/Serialization: Process of converting program data into a format that can be transmitted over a network
External Data-Representation: Standardized format for representing binary data
Interface Definition: Defines a service, a set of related functions that can be accessed remotely
Message Structure and Evolution: Way in which data is organized and transmitted between systems
Compilers: Generate code (stub/skeleton or proxy) to facilitate communication between systems
Request/Reply Protocol: Handles errors that may occur during the remote call process
Process/I/O Layer: Responsible for managing threads and input/output operations

1.5.3.2 Marshaling/Serialization

Marshaling/serialization is the process of converting parameters (basic types or objects) into a common transfer format (message) that can be transmitted over a network. At the target site, the message is transformed back into the original types or objects. There are several different approaches to marshaling/serialization, each with its own trade-offs:

Language-dependent output format: This approach uses a proprietary format that is specific to a particular programming language. It may be slower and have limitations in expressiveness, but it can be more efficient in terms of the size of the message.
Language-independent output format: This approach uses a standardized format that is not tied to any particular programming language. It may be more verbose and result in larger messages, but it is more flexible and can be used with a wider range of languages.
Binary schema-based: In this approach, the sender and receiver both have a shared understanding of the structure of the message, including the types and variables contained within it. Function names are replaced with numbers and variable data is encoded to save space. This approach is efficient in terms of message size, but it can be inflexible.
Binary self-describing: In this approach, the message contains information about its own structure, including the types and variables contained within it. This allows for more flexibility, but it requires the involved languages to have some level of flexibility in their capabilities.
Textual, self-describing: This approach uses an XML representation of the types or objects contained in the message. It is flexible, but it can be verbose and result in larger messages.
Textual with schema for reader/writer: This approach uses a schema to define the structure of the message, which allows for advanced schema evolution and dynamic serializations. It can be flexible, but it may result in larger messages.

Serialization to text is a method of converting data into a human-readable format, typically using a markup language such as XML. This approach can be less efficient in terms of the size of the resulting message, as it is generally less compact than binary serialization. It is important to be aware of language limits, such as the range of integers and floating-point numbers that can be represented in JavaScript, when using this approach. XML is a popular choice for language-independent encoding because it is widely supported and can represent a wide range of data types.

Serialization to binary is a method of converting data into a compact, machine-readable format. It requires a schema, which defines the structure of the message and the types of data it contains. This approach is generally more efficient in terms of the size of the resulting message, but it is not as flexible as text-based serialization because it requires a shared understanding of the schema between the sender and receiver. Binary serialization can also be used for language-independent encoding.

1.5.3.3 Schema Evolution

When data or functions change, it is important to consider how different versions of a system will coexist and communicate with each other.

Forward compatibility means that older receivers should be able to understand messages from newer senders. This allows newer versions of a system to be deployed without causing problems for existing clients.

Backward compatibility means that newer receivers should be able to understand messages from older senders. This allows older versions of a system to continue functioning even after newer versions have been introduced.

1.5.3.4 Keeping Compatibility when Evolving a Schema

For JSON

Forward compatibility means that older versions of a system should be able to understand messages from newer versions. Examples of changes that can be made to a system in a way that maintains forward compatibility:

Adding a new required field: Older readers will simply ignore the new field, so they will be compatible with newer versions.
Narrowing a numerical type (e.g. float to int): Older readers will assume ints, which are a subset of floats, so they will be compatible with the newer int type.

Backward compatibility means that newer versions of a system should be able to understand messages from older versions. Examples of changes that can be made to a system in a way that maintains backward compatibility:

Adding a field with a default value: Older writers will be unaware of the new field, so they will use the default value instead.
Adding an optional field: Older writers will be unaware of the new field, so they will use the default value (usually null) instead.
Widening a numerical type (e.g. int to float): Older writers will always use ints, which are a subset of floats, so they will be compatible with the newer float type.
Removing a field: Newer readers will ignore whatever was previously written in the field that was removed, so they will be compatible with older versions.

Full compatibility means that a change can be made to a system without breaking compatibility with either older or newer versions. Examples of changes that can be made to a system in a way that maintains full compatibility include:

Adding a field with a default value
Adding an optional field

Incompatibility means that a change to a system will break compatibility with either older or newer versions. Examples of changes that can break compatibility include:

Renaming a field
Changing the type of a field (other than the numeric conversions mentioned above)

1.5.3.5 Stubs and Skeletons

Stubs and skeletons are code that is used to facilitate communication between different systems, typically in the context of remote procedure calls (RPCs). Stubs are used by clients to initiate a remote call, while skeletons are used by servers to receive and process the remote call.

There are several ways to generate stubs and skeletons, including:

- Generating them in advance from an interface definition language (IDL) file: In this approach, the stubs and skeletons are generated from an IDL file, which defines the interfaces and data structures that are used for communication. The IDL file is used as a blueprint for generating the code.
Generating them on demand from a class file: In this approach, the stubs and skeletons are generated from a class file, which defines the classes and methods that are used for communication. The class file is used as a blueprint for generating the code.
Distributing them in advance to all clients and servers: In this approach, the stubs and skeletons are generated in advance and distributed to all the clients and servers that will be using them. This allows the systems to be set up and configured in advance, rather than on demand.
Downloading them on demand: In this approach, the stubs and skeletons are downloaded by the clients and servers as needed, rather than being distributed in advance. This allows for more flexibility and can be more efficient in terms of bandwidth usage.

1.5.3.6 Finding an RPC Server

In a remote procedure call (RPC) system, a client needs to be able to locate the server in order to initiate a remote call.

Service:

Start listening at port X: The server starts listening for incoming requests on a specific port.
Tell portmapper about program, version, and port: The server registers itself with the portmapper, which is a service that keeps track of the programs and ports that are available on the system. The portmapper is typically a separate daemon that runs on the system.

Client:

Ask portmapper for program, version: The client sends a request to the portmapper asking for the port number of the desired program and version.
The portmapper responds with the port number where the desired program and version can be found.
Send procedure call to service: The client sends the procedure call to the server on the specified port.

This process is known as “binding,” and there are several ways to handle it, including using inetd, DCE, or a Unix portmapper.

1.5.3.7 Factors to Consider when Choosing an IDL

Are data types easily expressed using the IDL? It is important to choose an IDL that can easily represent the types of data that will be used in the RPC system.
Is hard or soft versioning used? Hard versioning means that the IDL is strictly enforced, and any changes to the IDL will break compatibility. Soft versioning means that the IDL is more flexible and can be changed without breaking compatibility.
Are structures self-describing? Self-describing structures contain enough information to be understood by any system, even if it is not familiar with the specific structure. This can be useful for maintaining compatibility between different versions of a system.
Is it possible to change the structures later and keep backward compatibility? It is important to choose an IDL that allows for changes to be made to the structures in a way that maintains backward compatibility with older versions of the system.
Is it possible to change processing of structures later and keep forward compatibility? It is important to choose an IDL that allows for changes to be made to the processing of structures in a way that maintains forward compatibility with newer versions of the system.
Are there bindings for all languages in use at my company? It is important to choose an IDL that has bindings for all the languages that will be used in the RPC system.
Do I need different encodings (binary/textual)? It is important to consider whether binary or textual encoding will be needed for the RPC system, and choose an IDL that supports the desired encoding.
Does changing serialization require a recompile? It is important to consider whether changes to the serialization will require a recompile, as this can affect the flexibility and ease of maintenance of the system.
Can I extend/change the runtime system (e.g. add trace statements)? It is important to choose an IDL that allows for changes and extensions to the runtime system, such as the ability to add trace statements.

1.5.3.8 Popular IDLs for RPCs

CORBA (Common Object Request Broker Architecture): CORBA is a standard for interoperating between different programming languages and platforms. It defines an interface definition language (IDL) that can be used to describe the interfaces and data structures that are used for communication, as well as a runtime system for executing the RPCs.
Microsoft CLR (Common Language Runtime): The CLR is a runtime environment for executing .NET programs. It includes a cross-language call infrastructure that allows programs written in different .NET languages to communicate with each other.
Thrift: Thrift is a cross-platform RPC framework that allows different programming languages to communicate with each other. It includes an IDL for defining the interfaces and data structures that are used for communication, as well as code generators for generating the stubs and skeletons that are used to initiate and process the RPCs.
gRPC: gRPC is a high-performance RPC framework that uses HTTP/2 as the underlying transport. It includes an IDL for defining the interfaces and data structures that are used for communication, as well as code generators for generating the stubs and skeletons that are used to initiate and process the RPCs.

1.6 Distributed Objects

1.6.1 Overview

Fundamental Properties of Objects
Local and remote object references
Parameter passing
Object invocation types
Distributed Object Services
Object Request Broker Architectures
Interface Design
Java RMI

1.6.2 Basic Principles

1.6.2.1 Objects vs. Abstract Data Types

Objects are data structures that combine state (data) and behavior (methods or functions) in a single entity. They are a key concept in object-oriented programming (OOP).
Abstract data types (ADTs) are data structures that are defined by the operations that can be performed on them, rather than by their implementation. ADTs are often used to model real-world concepts or objects.

There are some differences between the two:

Objects have an identity, which is a unique identifier that distinguishes the object from other objects in the system. ADTs do not have an identity in the same way that objects do.
Objects may store state (data) within themselves, while ADTs do not store state. Instead, they define the operations that can be performed on data stored elsewhere.
Objects may be implemented in different ways depending on the programming language or system in which they are used. ADTs, on the other hand, are defined more abstractly and can be implemented in many ways.

1.6.2.2 Properties of Objects

Local objects are created with the new() operator and are only accessible within the scope of their creator.
An object reference is a unique identifier that is used to locate and access an object. It is returned when the object is created and can be used to call the object’s methods.
Objects have a lifecycle that is tied to their creator. They exist as long as the virtual machine (VM) is alive and the objects are in use, and they are typically destroyed when their creator is destroyed.
Objects have fine-granular interfaces and methods, which means that they expose a large number of individual functions or operations that can be performed on them.
Objects can be small and numerous, which means that there may be many objects in a system, each with its own state and behavior.

Objects in an object-oriented (OO) programming language do have many properties that can make them challenging to implement in a concurrent or distributed system.

1.6.2.3 Challenges for Remote Objects

Remote objects (ROs) are objects that are accessed and managed remotely, typically over a network. They are a key concept in distributed object systems, which are systems that enable objects on different machines to communicate and interact with each other.

Object identity is usually only valid locally, meaning that it is only meaningful within the scope of the machine on which the object is stored. This can make it difficult to identify and access ROs from a remote machine.
ROs must be created and managed by a server, rather than by a client. This is because the client does not have direct access to the objects and cannot create them using the new() operator.
Clients must have a way to find and access ROs, which may involve using a registry or other lookup service.
Concurrent access to ROs must be controlled to prevent conflicts and ensure the consistency of the objects’ state. This may involve using locks, mutexes, or other synchronization mechanisms.
ROs may have a longer or shorter lifespan than local objects, depending on how they are managed by the server.
If a server dies, the ROs it was managing may also be destroyed, depending on how the system is configured. This could potentially cause clients to lose access to their objects.
The state of an RO may be maintained by the server or by the client, depending on the implementation of the system.
If a client dies, the ROs it was using may continue to exist on the server, depending on how the system is configured.
The cost of using ROs may be higher than using local objects due to the overhead of communication and synchronization across the network.
ROs may have the same interface as local objects, or they may have a different interface that is tailored to the needs of remote clients.

1.6.2.4 What is a Remote Object?

A remote object is an object that is accessed and managed remotely, typically over a network.
It is a combination of a unique identity, an interface, and an implementation.
The unique identity is used to locate and access the object from a remote machine.
The interface defines the operations that can be performed on the object.
The implementation is the code that defines how the object behaves and carries out these operations.
Clients know the interface and use the identity of the remote object, but do not know about the implementation.
To clients, the interface of the remote object represents the entire object.
Achieving complete transparency of remote calls behind object interfaces can be challenging in practice, especially in distributed systems.

1.6.3 Components of Distributed Objects

1.6.3.1 Object Models and Type Systems

CORBA:

Defines its own basic types, including sequence, string, array, record, enumerated, and union.
Uses value objects (data) to represent data that is passed between objects.
Uses remote object references (reference semantics) to locate and access objects on remote machines.

CORBA is designed to provide language independence, so it defines its own types and does not allow user-defined classes to be used in the interfaces of remote objects.

Java RMI:

Uses the basic types of the Java language, such as int, byte, etc.
Allows serializable non-remote objects (value semantics) to be used as value objects.
Uses remote object references (reference semantics) to locate and access objects on remote machines.

Java RMI allows user-defined classes to be used as value objects if they are serializable, but is specific to the Java programming language.

1.6.3.2 Accessing Remote Objects

A naming service is a service that acts like a directory and allows clients to look up the location of a remote object based on its name or other identifier. The client can then use the location information to access the object.
A web server can be used to host serialized versions of remote objects. The client can request the object from the server over the web and deserialize it to access its methods and state.
Remote objects can be accessed through other means, such as via mail or a piece of paper. For example, a client might send a request for a remote object to another machine via mail or other physical means, and the server could return the object or a reference to the object in the same way.
Another remote object can serve as a “factory” that creates and manages other remote objects. The client can access the factory object and use its methods to create and access other remote objects as needed.

1.6.3.3 The Broker Pattern

Is a design pattern that involves using a separate component (the broker) to mediate communication between two other components (the client and the service).
Serves as an intermediary between the client and the service and is responsible for routing requests and responses between them.
Decouples the client and the service, allowing them to evolve independently and communicate through the broker.

1.6.3.4 Remote Object Reference

Object implementation (servant): The object implementation is the code that defines how the object behaves and carries out its operations. This is also known as the object’s “servant,” as it serves the object and carries out its requests.
Object adapter: The object adapter is a component that manages the communication between the object implementation and the object request handler (ORB). It is responsible for routing requests from the ORB to the object implementation and returning responses to the ORB.
Active object map (remote object table): The active object map is a data structure that maps object identifiers to object implementations. It is used to locate the object implementation for a given object identifier.
Object request handler (ORB): The ORB is the component that handles requests to and from remote objects. It is responsible for marshalling and demarshalling requests and responses, as well as routing requests to the appropriate object adapter.
Host, port, protcocol, object adapter, object ID: The system-wide object reference

1.6.3.5 Static vs Dynamic Invocation

Static Remote Method Invocation

Involves using pre-generated stubs that are linked to the client program in advance.
The client program uses the stubs to invoke the remote method in the same way that it would invoke a local method.
The stubs handle the details of marshalling and demarshalling the request and response, as well as routing the request to the appropriate object on the remote machine.

Dynamic Invocation:

Involves building a request object at runtime, based on the meta-information of the remote object.
The request object is sent to a dispatcher on the servant host, which handles the request as if it were a normal method invocation.
The dynamic invocation approach is similar to the reflection pattern, which involves using meta-information about an object to manipulate it at runtime.

1.6.3.6 Asynchronous Invocations

One-way calls: One-way calls are calls that do not expect a response from the server. They are called “one-way” because they involve only a single message being sent from the client to the server. One-way calls cannot have return values or out parameters, and their delivery is best-effort (meaning that there is no guarantee that the message will be delivered).
Deferred synchronous: Deferred synchronous invocations involve making a call to a remote method and continuing to execute while the method is being executed. The client can later check for the results of the method (blocking), but the delivery is at-most-once (meaning that the message will be delivered at most one time).
True asynchronous with server callbacks: True asynchronous invocations involve making a call to a remote method and continuing to execute while the method is being executed. The server can differentiate between synchronous and asynchronous calls and can send a callback message to the client when the results of the method are available. True asynchronous invocations require messaging middleware to achieve at-most-once delivery guarantees.

1.6.4 Implementing Remote Objects

1.6.4.1 Main Distributed Object Services

Finding objects:
- Naming service: Maps names to object references
- Trading service: Allows objects to offer their services and allows clients to search for objects by constraint
Preserving object state
- Persistence service: Stores object state transparently and allows it to be loaded on demand
- Transaction service: Preserves object consistency across changes to multiple objects in a distributed, nested, or flat context
- Concurrency service: Provides locks for shared objects
- Security service: Checks the roles of principals
Grouping of objects: Collections

1.6.4.2 Portable Interceptors

Interceptors can be used to transparently add additional (context) information to calls and transport it between object request brokers (ORBs).

1.6.4.3 Remote Interface Design

Respect the possibility of concurrent calls in your interface design: Avoid keeping inconsistent state across method calls.
Avoid the “half-baked object” anti-pattern: Do not perform staged initialization of an object.
Avoid using complicated or unclear orders of calls: Design the interface in a clear and straightforward manner.

1.6.4.4 Problems with Remote Objects

Interfaces: Remote object interfaces can be too granular, with many small methods that perform simple tasks. This can lead to slow performance, as each call to a remote object involves a significant amount of overhead.
No direct support for state handling on servers: Distributed object systems do not generally provide direct support for managing the state of remote objects on the server side. This can make it difficult to maintain consistent state across multiple remote objects or to persist the state of an object across different contexts.
Bad for “data schlepping” applications: Remote objects are often too expensive to use in applications that involve “data schlepping,” or the transfer of large amounts of data between the client and server. This is because each call to a remote object involves significant overhead, which can make it inefficient to transfer large amounts of data.
Cross-language calls expensive to build: Making calls between objects implemented in different programming languages can be expensive, as it often requires the use of complex marshalling and unmarshalling mechanisms to convert data between the different languages.
No security in calls: Distributed object systems do not generally provide built-in security for remote method calls, which can make it difficult to enforce security policies or protect sensitive data.
No transaction support: Distributed object systems do not generally provide built-in support for transactions, which can make it difficult to ensure the consistency of data across multiple objects in a distributed system

1.6.5 Java RMI

1.6.5.1 Java RMI Request/Reply Protocols

JRMP (Java Remote Method Protocol) is the first protocol for RMI. It has the following characteristics:

Bandwidth problems due to distributed garbage collection with short term leases and permanent reference counting.
Allows for the dynamic download of code.

RMI-IIOP (RMI over CORBA’s Internet Inter-Orb Protocol) has the following characteristics:

Uses Java Naming and Directory Interfaces (JNDI) to lookup object references and is persistent.
Requires code changes and the use of PortableRemoteObject.
Requires the generation/definition of code and IDL files for CORBA systems.
Allows for the movement of IDL files for Java Remote Object Interfaces to CORBA systems, and the generation of CORBA stubs with IDL compilers. This allows CORBA systems to call Java remote objects, or for Java systems to call into CORBA systems.

1.6.5.2 Java RMI Classes & Tools

Remote: Tag interface that Remote Object Interfaces extend.
RemoteException: Class that is thrown by all Remote Object methods.
Naming: Class that is used by clients to find remote object references, and by servers to register their objects.
UnicastRemoteObject: Class that Remote Object Implementations extend.
rmic: Tool that generates stub/skeleton/IDL files.
registry: Simple name server for Java objects.

1.6.5.3 Activation in Java RMI

Activation is an important feature in Java RMI (Remote Method Invocation) because it allows servers to transparently store servant state on persistent storage and recreate servants on demand. This helps the server control its resources against memory exhaustion and performance degradation.

1.6.5.4 Security in Java RMI

Specify the quality of service (QOS) of sockets used by RMI, such as using an SSL channel.
Use an RMISecurityManager to prevent or control local access from downloaded implementations.

1.7 Distributed Business Components

1.7.1 Overview

Part One: General Component Technology covers the following topics:
- The limitations of traditional object-oriented programming.
- Distributed components.
- Mapping business concepts to programming constructs.
Part Two: Enterprise Java Beans Example covers the following topics:
- Object model of Enterprise Java Beans (EJBs).
- Basic mechanisms of EJBs.
- Separation of concerns in EJBs, including persistence, transactions, and security.
- Separation of context in EJBs, including the environment.
- Evolution and lessons learned from EJBs.

1.7.2 Basic Principles

1.7.2.1 Problems With Object Interfaces

Object interfaces are tightly intertwined networks of references that share state and promises, and changes to these interfaces can have ripple effects.
A component framework can simplify the interface for calling objects by encapsulating them.
A messaging system can be stateless, or it can include all necessary state in the message itself, which is known as context-complete communication. This approach is used in webservices and REST architectures and is known as Service Oriented Architecture (SOA).

1.7.2.2 Component Based Processing

Components are self-contained software packages with runtime interfaces and automatic deployment capabilities that are designed to fit into a component framework.
A component framework allows components to be plugged in and supports the composition and collaboration between them.
There are roles for development, composition, and installation of components in the component model.

Enterprise components build on this:

Integration with existing infrastructure, such as transactions, security, and legacy systems
Network addressable interfaces
Medium to large granularity, potentially representing 10-20 tables or more
Representation of a business concept in an isomorphic manner

1.7.2.3 From Objects to Components

Object-oriented design involves isolated, monolithic applications that are not distributed.
Distributed objects involve calls between applications, but can have management and performance issues with large numbers of small remote objects.
Distributed systems involve multi-tier systems with point-to-point connectivity, but can be expensive and difficult to develop.
Distributed components use a framework for pluggable business components and create a market for interoperable components. This approach addresses modeling, development, and deployment and aims to achieve re-use and lower development costs.

Components go beyond distributed systems and were designed to address some of the same issues as object-oriented development.

1.7.2.4 Business Components

Business concepts, such as entities and processes, can be translated into software artifacts like EC1 and PC1.
Business components are designed to directly represent concepts from the business and may be represented using the UML “package” construct.

1.7.2.5 Alternatives to Isomorphic Mapping

Isomorphic mapping, or the idea that software artifacts directly correspond to business concepts, may not always be the best approach.
Alternatives to isomorphic mapping include domain analysis, generative computing, and aspect-oriented development.
In generative computing, different models are used to capture the transformation process, including computation-independent models, platform-independent models, and platform-dependent models.

1.7.2.6 Generative Computing

Involves the use of domain-specific languages and code generators to automate software development
Reduces the need for manual coding and allows developers to focus on design and functionality
Uses a high-level, abstract description of the desired software, which is then converted into actual code by a code generator
Can save time and reduce the risk of errors by eliminating the need to manually write and debug code
Often used in domains with repetitive or boilerplate code or where the complexity of the code makes manual coding impractical
Can be used to create customizable software for different users or environments

1.7.2.7 Monolithic Software vs. Components

Components are clusters of software, configuration, and other elements that form a unit for deployment and maintenance within a component framework.
Components are made up of collaborating elements and can be adjusted without requiring source code changes. This is an alternative to traditional applications, which are monolithic and cannot be easily modified.

1.7.2.8 Objects vs. Components

A regular object combines business logic with specific mechanisms such as persistence and hides the internal interfaces behind the external interface.
Objects may make assumptions about the environment, such as which database to use, and these are hidden in the code.
Customizing objects requires code changes, which can be time-consuming and risky.

1.7.2.9 Separation of Concerns and Context

The internal interface of a component is described in meta-information, which allows deployers to connect the component to the appropriate framework services.
Concerns such as persistence and transactions are separated from the business logic of the component and are typically implemented by the framework.
Context information, such as which database to use, is contained in meta-information rather than in the code of the component.
This separation of concerns and context allows components to be customized after development.

1.7.3 Enterprise Java Beans

1.7.3.1 Overview

Enterprise Java Beans (EJBs) allow the construction of distributed applications by combining components from different vendors.
Developers do not need to understand low-level distributed mechanisms such as transactions when using EJBs.
Can run in EJB containers from different vendors without modification.
Provide enterprise lifecycle support, including development, deployment, and runtime support.
Provide enterprise data support.

1.7.3.2 EJB Component Model

The EJB component model includes four types of EJBs: stateless session beans, stateful session beans, entity beans, and stateless message-driven beans.

Stateless session beans provide stateless services that are shared among clients.
Stateful session beans hold conversational state and are not shared among clients.
Entity beans map to rows in a database and are shared among clients. They are used to store and access business data in a persistent manner.
Stateless message-driven beans receive messages asynchronously and can connect to an enterprise MOM.

These different EJB types allow for scalability, client code on the server side, asynchronous processing, and the representation of business data. Entity beans are permanent, while stateful session beans do not survive a server crash.

1.7.3.3 Session Beans (Stateful and Stateless)

Session beans are per-client objects, except for stateful session beans which represent client code on the server side.
Both stateful and stateless session beans can participate in transactions if the session-synchronization interface is used.
Session beans may access databases, but do not directly represent persistent objects.
Short-lived and are removed when the container crashes.
Stateless session beans are the most scalable and EJB servers should be able to support large numbers of them.

1.7.3.4 Entity Beans (Deprecated Since 3.0)

Entity beans are shared objects that are protected through transactions.
They have a longer lifetime than session beans and represent important business data.
Entity beans can be persisted through the container mechanism (Container Managed Persistence, or CMP) or they can handle their own persistence (Bean Managed Persistence, or BMP).
Entity beans have a unique identity, which is visible to clients through a primary key, and clients can request a “handle” which is a persistent pointer to the entity object.
In EJB 3.0, persistence is no longer a concern for EJBs and is instead covered by the Java Persistence API.

1.7.3.5 Message-Driven Beans

Message-driven beans are invoked asynchronously and do not have client context available during processing.
They can be transaction aware, meaning that message receipt and processing can be enclosed in a single transaction.
Message-driven beans are short-lived and stateless, and are removed when the container crashes.
They do not directly map to business data.
Message-driven beans have been integrated into the general EJB framework to reuse EJB container services such as transactions, security, concurrency, and deployment description.
If a message-driven bean crashes during processing, the message is marked as “unread” and can be processed after the container is restarted.

1.7.3.6 Client View of EJBs

Clients never access the bean class directly.
EJBs can offer a remote or local interface to clients (clients of the local interface must be in the same Java virtual machine as the bean container).
The business logic is contained in the bean class.

1.7.3.7 Local vs. Remote Interfaces

The remote interface of an EJB uses remote object calling conventions, while the local interface uses local Java calling conventions.
When calling the local interface of an EJB, value objects are passed by reference, meaning the client and EJB will share the same objects.
The remote interface, on the other hand, requires that value objects be copied.
EJB implementers who want to provide both a local and a remote interface must be aware of these different calling conventions and design their beans accordingly.

1.7.3.8 Separation of Concerns and Context in EJB

Separation of concerns is done through the EJB framework, separation of context is done through deployment:

The automatic transaction management feature maps to system management transaction modes.
Persistence maps to system management definitions of data sources and pool sizes.
Automatic, method-level security maps to system management definitions of role/user binding.
The roles involved in component development, application assembly, and deployment map to the deployment descriptor and JNDI interface.

1.7.3.9 The EJB Container

A client invokes an entity bean interface.
The container delegates the request to the entity bean business logic.
The entity bean business logic delegates tasks such as transactions, persistence, and security to resources accessed through JDNI and a database.

At the point of interception, the container provides resource management, lifecycle, state management, transactions, and security services to the bean.

Containers are increasingly taking over roles that were previously the responsibility of operating systems, such as isolating different applications from each other.
In the J2EE environment, class loaders are used for this purpose.
However, it is believed that a better concept is needed that does not mix loading with isolation.

1.7.3.10 Containers and Threads

Containers manage resources across applications and store context and session information in threadlocal storage.
As a result, container-managed applications are not allowed to create their own threads, as these threads would not have the necessary metadata and context information.
EJB 3.0 offers a managed service for connectors to create threads, allowing applications to use threads without having to manage resources themselves.
Applications should not assume responsibility for resource management, as the container is responsible for choosing the appropriate resource management policy.

1.7.3.11 Entity Bean Container Contract

The Entity Bean-Container Contract defines a set of methods that the container can call on an entity bean in order to manage its lifecycle and access resources.

These methods include:

setEntityContext: Bean stores context as an interface to the environment
PrimaryKeyClass ejbCreate: Actions related to bean instance construction
ejbPostCreate: Bean identity is now available
ejbActivate: Bean can acquire necessary resources
ejbPassivate: Bean releases resources, expecting to be put back into the pool
ejbRemove: Last chance for the bean before destruction
ejbStore: Bean should update its internal state, expecting it to be synchronized with the database right after this
ejbLoad: Bean should update its internal state, expecting that its virtual fields have just been read from the database
ejbFind, ejbSelect: Query methods generated at deployment
ejbHome<method>: Business logic that does not require an object identity

The actions that can be performed in these framework methods depend on the availability of a transactional context, object identity, local or remote view, and client security context.

1.7.3.12 Bean Managed vs. Container Managed Persistence

There are two main approaches to persistence in EJBs: bean-managed and container-managed.

In bean-managed persistence, the bean is responsible for performing its own persistence, with the timing of these actions controlled by the container.
In container-managed persistence, the bean’s state is completely stored and loaded by the container.

It is generally believed that container-managed persistence is the preferred approach in the future, as it is more portable and does not require adjustments to different datastores. Bean-managed persistence, on the other hand, is less portable and requires more effort to adapt to different datastores.

1.7.3.13 JNDI Naming Context

EJBs locate all their resources through JNDI (Java Naming and Directory Interface) calls, which allows deployers to specify the proper services for each EJB through the deployment descriptor.
This allows the deployer to manipulate all JNDI lookups and customize the resources that each EJB has access to.

1.7.3.14 Deployment Descriptor

The deployment descriptor is a file that contains meta-information about an EJB in XML format.
This meta-information is used by different components, including the bean itself (to declare the names and interfaces used), the deployer (to adjust values for specific environments), and generators (to generate queries from the meta-information).

1.7.3.15 Security Support

Principal delegation:

In this mode, the EJB container passes along the identity of the original caller (client) to the EJB, allowing the EJB to operate under the same identity as the client.
This allows the EJB to access resources and perform actions based on the permissions of the client.

“Run as” identity:

In this mode, the EJB container assigns a specific identity to the EJB, regardless of the identity of the original caller.
This allows the EJB to operate under a specific identity that is different from the client’s identity, and allows the EJB to access resources and perform actions based on the permissions of the assigned identity.
This can be useful in situations where the EJB needs to perform actions on behalf of a specific user or group, regardless of the identity of the client.

1.7.3.16 Transaction Modes

EJBs support several transaction modes, which allow developers to specify the level of transaction support required by the EJB.

The available transaction modes are:

Not supported: The EJB does not require or support transactions.
Required: The EJB requires that a transaction be active when it is called, and will participate in the existing transaction if one is already active. If no transaction is active, a new one will be started.
Supports: The EJB will participate in an existing transaction if one is active, but it does not require a transaction to be active when it is called. If no transaction is active, the EJB will execute without a transaction.
RequiresNew: The EJB requires a new transaction to be started when it is called, regardless of whether a transaction is already active.
Mandatory: The EJB requires that a transaction be active when it is called, and will throw an exception if no transaction is active.
Never: The EJB does not support transactions and will throw an exception if a transaction is active when it is called.

1.7.3.17 Best Practices for EJBs

Use local interfaces if possible to reduce network overhead
Use transfer objects to reduce network traffic
Avoid excessive use of entity beans and use local session beans instead
Use message driven beans for asynchronous processing
Use container managed transactions and security to simplify development
Use container managed persistence and avoid bean managed persistence if possible
Use the Java Persistence API (JPA) for persistence management instead of entity beans in EJB 3.0 and later
Properly design your data model and access patterns to optimize performance
Use performance monitoring tools to identify and fix bottlenecks in your application
Properly size and configure your EJB server to meet the needs of your application.

1.7.3.18 Shortcomings of EJBs

Large number of artifacts for the programmer to control
Meta-data separated in deployment descriptor instead of code
Home interfaces and finding of remote objects tedious
Performance problems in the O/R mapping due to abstract schema approach
No rapid prototyping possible
Entity beans overloaded with security, transactions and persistence

1.7.3.19 New Ways for Object-Relational Mapping

The trend towards using generic, abstract schema mappings for O/R mapping appears to be over, and SQL is becoming more prevalent.
Developers have historically had difficulties with the highly abstract entity beans used in Enterprise Java Beans (EJBs) for persistence. The use of plain old Java objects (POJO) for persistence has been seen as a more straightforward and effective solution.

1.7.3.20 Lessons Learned from EJBs

It takes a long time to develop and scale a complex framework for components like EJBs.
It is important to get the interfaces right in order to achieve good performance.
Developers may not fully understand the implications of abstractions.
Frameworks can sometimes require developers to do unnecessary code duplication.
It is important to avoid coupling too many concerns (such as transactions, persistence, and security) in a single framework.
Code generation can be helpful, but it requires tooling and customization.
It is best to wait for “best practice patterns” to emerge before implementing complex new technology on a large scale.

1.8 Services

1.8.1 Overview

A recap of CORBA
Web services
SOA
Microservices
Conway’s Law
Serverless computing

1.8.2 Timeline of Distributed Service Architectures

In the early 1990s, distributed service architectures for use in intranets (private networks) emerged, including Unix RPC and DCE.
In 1998, CORBA (Common Object Request Broker Architecture) was introduced as a distributed service architecture for use in intranets.
In the early 2000s, the REST (Representational State Transfer) architectural style was introduced for building web services that could be accessed over the internet.
In 2004, SOA (Service-Oriented Architecture) emerged as a way to design and build web services that could be easily reused and composed to create larger, more complex systems.
In 2010, the Microservices design pattern became popular as a way to build large, complex applications by composing small, independent services that communicate with each other through APIs.
In 2016, the concept of Serverless Computing, emerged as a way to build and run applications and services without the need to manage underlying infrastructure.

1.8.3 CORBA

1.8.3.1 CORBA Security Model

In the CORBA security model, the concept of secure delegation is used to ensure that communication between systems is secure.
When systems communicate with each other, they authenticate themselves to ensure that they are authorized to exchange information.
Tokens are used to flow client information between systems, but no secrets are shared.
Defined routes are used to prevent token abuse and ensure that later tiers can verify the original requestor and route of the request.

1.8.3.2 CORBA Core Properties

CORBA was primarily designed as an intranet technology for use within private networks.
CORBA is language-independent, with a focus on defining interfaces between systems.
The IIOP (Internet Inter-ORB Protocol) is the base protocol used in CORBA to ensure interoperability and handle cross-cutting concerns.
IIOP also provides delivery guarantees for communication between systems.
CORBA was primarily used to connect heterogeneous (legacy) software in large corporations.
The standardization process for CORBA was difficult and tedious.
The use of “boilerplate code” in CORBA often led to extensive code generation and model-driven development.

1.8.4 Web Services

1.8.4.1 Overview

A Web Service is a software component that represents a business function or service, and can be accessed by other applications over public networks using standard protocols and transports (such as SOAP over HTTP).
Web Services use XML to create requests and responses and send them using HTTP, allowing machines to communicate with each other for various purposes, such as supply chain management or business-to-business processing.
XML-RPC, proposed by David Winer, was one of the earliest standards for Web Services.
Web Services have been used internally by companies for some time.

1.8.4.2 Web Services Core Properties

Web Services use “simple” requests that can be sent over public networks/the internet using HTTP transport for firewall reasons.
XML is used as the message format for Web Services, making them language-independent.
Features such as reliability, security, and transactions were added to Web Services to improve their functionality.
Many Web Services were re-writes of CORBA interfaces using XML syntax and expressing a business function.
Web Services were heavily promoted as a solution for automatic interoperability based on self-describing services and ontologies, but this was largely overhyped.
The technical foundation for Web Services was provided by forms of XML-RPC, although the acronym “SOAP” (Simple Object Access Protocol) did not actually have anything to do with distributed objects.

1.8.4.3 UDDI Functionality

UDDI (Universal Description, Discovery, and Integration) is a registry that provides a “find and publish” API for distributed services. It works like this:

Providers publish their services in a registry.
Requesters search for the desired service in the UDDI (Universal Description, Discovery, and Integration) registry.
The UDDI registry retrieves the provider location and WSDL (Web Service Description Language) service description for the requester.
The requester creates a request based on the WSDL description.
The requester sends the request to the provider using a specified transport protocol (such as SOAP over HTTP).

This type of architecture is called “service-oriented” because it uses a broker for service advertisement and lookup, and requester and provider bind dynamically with respect to the transport protocol used.

1.8.4.4 UDDI Content and Categories

All content in UDDI is expressed in XML.
The UDDI registry includes information about companies and services, as well as meta-information elements such as tModel.
A key feature of UDDI is the expectation that requester and provider will do a dynamic bind, agreeing on service and transport characteristics.

The UDDI registry has three main categories of information:

White pages: information about companies, such as location and contact details.
Yellow pages: business categorization and classification by type and industry.
Green pages: meta information about services and their qualities.

1.8.4.5 WSDL Overview

WSDL (Web Service Description Language) is the metadata language used by Web Services.
WSDL defines how service providers and requesters understand Web Services.
When exposing back-end systems as Web Services, WSDL defines and exposes the components and lists all the data types, operations, and parameters used by the service.
WSDL provides all the information that a client application needs to construct a valid SOAP invocation, which is then mapped onto back-end enterprise logic by the Web Services platform.

1.8.4.6 WSDL Elements

WSDL documents define services as collections of network endpoints or ports.
The abstract definitions of endpoints and messages in WSDL are separated from their concrete network deployment or data format bindings.
Separation of abstract and concrete definitions allows for the reuse of abstract definitions

It includes the following elements:

Types: A container for data type definitions using a specified type system (such as XSD).
Message: An abstract, typed definition of the data being communicated.
Operation: an abstract description of an action supported by the service.
Port Type: An abstract set of operations supported by one or more endpoints.
Binding: A concrete protocol and data format specification for a particular port type.
Port: A single endpoint defined as a combination of a binding and a network address.
Service: A collection of related endpoints.

1.8.4.7 SOAP

SOAP is an RPC (Remote Procedure Call) protocol that uses XML. It includes elements for type marshalling and RPC semantics.
The header element in SOAP can contain meta-information, but it is optional.

There are several aspects that define it’s performance:

Marshaling time
Internet transport time
Effect of size on transport
Demarshaling time

It has been found that internet transport time, especially in the absence of Quality of Service (QoS) measures, has a greater impact on overall request time than the size and interpretation effort of a textual format.

1.8.4.8 Security and Web Services

SOAP, WSDL, and UDDI: Message Envelope, Interfaces Definition, and Registry.
WS-Security: Secure Messaging Definitions.
WS-Trust: How to Get Security Tokens (issuing, validation, etc.).
WS-Federation: How to Make Security Interoperable Between Trust Domains.
WS-Policy: How to Express Security Requirements.
SAML: A Language to Express Security-Related Statements.
WS-Reli: Rights Management.
WS-Util: Helper Elements.
WS-Authorization: Expression of Access Rights.

1.8.4.9 Reliable Messaging

Reliable B2B (Business-to-Business) messages require the following qualities:

Guaranteed delivery (acknowledgement enforced)
Duplicate removal (using message ID)
Message ordering (using sequence numbers)

SOAP and HTTP partially achieve this like so:

The first application layer exchanges persistent messages with the requester.
The requester sends a SOAP message with a message ID, sequence number, and QoS (Quality of Service) tag to the receiver.
The receiver must send an acknowledgement.
The receiver exchanges persistent messages with the second application layer.

1.8.4.10 Secure Messages

Digital signatures with XMLDsig
Digital encryption with XMLEnc
WS-Security moves from channel-based security to message (object)-based security, allowing individual messages to be signed and encrypted.
WSDL can advertise the QoS expected/provided by a receiver.
End-to-end security is possible across intermediaries.
Today, federation (expressed in the WS-Federation standard) is more important, as seen in the use of OAuth2.
Intermediaries can add signatures or encryption to the SOAP envelope to create a chain of trust.
New signatures or encryption information is always prepended to existing information.
No encryption of the envelope, header, or body tag is allowed.
Signatures must respect the right of intermediaries to change the envelope or some header information.

1.8.4.11 SAML

SAML allows externalization of policies and mechanisms related to authentication, authorization, and attribute assertion.
The access control point only needs to check assertions and does not have to implement these mechanisms.
SAML allows interchangeability of statements between different services because the format of the assertions is fixed.

1.8.4.12 Transaction Models

Atomic transactions:

Are not nested (standalone)
Are short
Involve a tightly coupled business task
Can be rolled back in case of error
Can be disrupted by system crashes

Activity transactions:

Involve nested tasks
Are long-running
Involve a loosely coupled business activity
Include compensating tasks and activities to address errors
Can be disrupted by errors such as order cancellations

1.8.4.13 Stateful Web Services

Stateful architectures, such as computational grids, require the concept of a resource.
WS-Resource is a protocol that adds resource information to web services through metadata descriptions in the WSDL and WS-Addressing schemas.
An identifier is used to communicate state information between requestors and endpoints.
Advanced notification requests can be built on top of WS-Resource.

1.8.4.14 Scaling Web Services

Avoid using XML messaging for fine-grained RPC, such as requesting the square root of a number or a stock quote.
Use course-grained RPC instead, with web services that “do a lot of work, and return a lot of information”.
Consider an asynchronous messaging model when the transport may be slow or unreliable, or the processing is complex or long-running.
Take the overall system performance into account and don’t assume that XML’s “bloat” or HTTP’s limitations are a problem until they are demonstrated in your application.
Consider the frequency of messaging when designing your web service. A high rate of requests may suggest that you load (replicate) some data and processing back to the client.
For web services that aggregate data from other web services, consider performing the aggregation on demand or during off-hours in one large, course-grained transaction.

1.8.4.15 Why UDDI Failed

UDDI relied on the following assumptions:

Central registries of service descriptions
Independent automatic agents searching for services
Machines understanding service descriptions
Machines deciding on service use
Machines being able to use a service properly
Machines being able to construct advanced workflows from different services

These assumptions were problematic because:

There is often ambiguity, undefined aspects, and incompatible terms in service descriptions and interfaces.
It is difficult for machines to properly use and construct advanced workflows from different services.

To summarize, UDDI lacks technology to address the following issues:

The meaning of data types and interfaces
The meaning of actions
Risk assessment
Understanding flows and goals
Understanding and matching of constraints

1.8.4.16 Lessons Learned from Web Services and CORBA

Web services and CORBA are low-level concepts that lack semantics.
Workflow has not been effectively addressed by web services and CORBA.
Service-Oriented Architecture (SOA) is not only about interfaces and interface design, but also about hosting services.
The availability of a service on the web is more valuable than many specifications and interfaces.

1.8.5 SOA

1.8.5.1 SOA Core Properties

Services offer high-level interfaces that relate to business functions.
Service choreography (i.e. the coordination of services to achieve a larger business process) is performed outside the individual services.
Semantic standards and technologies (such as OWL, SAML, and the Semantic Web) are used to enable agents to understand services and their interfaces.
Legacy applications can be made available to other companies through the use of a service interface.
SOA relies on Web Service technology as its foundation.

1.8.5.2 SOA Interface Design

Object interfaces can be conversational, accept transactions, and are fast. They use object references.
Component interfaces use value objects, have a transaction border, and are generally stateless. They are relatively fast.
Service interfaces are used for long-running transactions with state stored in a database. They include compensation functions and have short process times but long business task execution times. Service interfaces are isolated, independent, and can be composed with larger services (choreography) or made up of smaller services (orchestration). They are stateless.

1.8.5.3 SOA Blueprint Service Types

Component Services: These are atomic operations on simple objects, such as database access.
Composite Services: These are atomic operations that use multiple simple services (orchestration) and are stateless for the caller.
Workflow Services: These are stateful services with defined state changes, with the state kept in a persistent store.
Data Services: These provide information integration through a message-based request and response mechanism.
Pub/Sub Services: These are event-based services with callbacks and registration.
Service Broker: These are intermediate services that manipulate and forward messages based on rules.
Compensation Services: These revert actions, but do not roll back like traditional transactions.

1.8.5.4 SOA vs. Microservices

SOA:

Services are usually larger and more transactional.
Prefer orchestration using higher level middleware components.
Enterprise-oriented architecture with middleware (messaging), service layers, and ownership concepts.
Uses metadata and messaging middleware for contract decoupling.
Follows a “share as much as possible” approach.
Big services are transactional.

Microservices:

Smaller and more specialized than SOA services.
Use eventual consistency technologies, which are not ACID.
Prefer choreography, which can lead to highly connected and dependent systems.
Have a simple API gateway and teams own their infrastructure and business services.
Do not use contract decoupling.
Follow a “share as little as possible” approach.

1.8.6 REST

1.8.6.1 RPC vs. REST

The web is based on representing resources using URIs, while web services create private, non-standard ways of accessing information.
The envelope paradigm used in RPC does not offer any benefits over the generic HTTP methods (GET, PUT, POST).
RPC mechanisms are not suitable for the web. Some extensions to the HTTP methods might be necessary to support certain features, such as tuple-space systems.

1.8.6.2 The Web’s Architecture

Follows a client-server model, where clients request resources from servers.
Has a uniform interface, with resources identified using URIs and manipulated through representations.
Messages are self-descriptive, with metadata, headers, and other information that allows them to be understood and processed by clients and servers.
Uses hypermedia as the engine of application state (HATEOAS), meaning that responses to requests include actionable links that allow clients to navigate the system and access related resources.
Is a layered system, with intermediaries such as caching servers, security systems, and load balancers playing important roles.
Is designed to be cacheable, with responses indicating whether they can be cached and for how long.
Is stateless, meaning that clients need to provide context and state information with each request.
Allows for code-on-demand, where servers can send scripts, applets, and other code to clients as needed.

1.8.6.3 REST Maturity Model

REST Level 0 (RPC): This level resembles regular RPC, with function calls being made to a single endpoint. Resources are not accessible and do not have an identity.
REST Level 1 (Resources): Function names and parameters are turned into resources, and appointments now have an identity that can be accessed through GET and POST requests.
REST Level 2 (HTTP verbs): It is important to use the correct HTTP verb (GET, POST, etc.) to indicate the intended action. GET is idempotent and creates cachable resources. Response codes are used to indicate the status of the request, such as the creation of a new resource or a conflict.
REST Level 3 (HATEOAS): Responses include encoded actions that can be invoked by the client. Services can change their URIs without breaking clients, and the “rel” attribute is used to describe the semantics behind the URI link.

1.8.6.4 REST Resource Archetypes

Document: Fields and links representing a base resource. Created using POST.
Collections: Containers maintained by the server with URI generation. Created using POST.
Stores: Container elements maintained by the client with “put” and without URI generation on the server side. Inserted or updated using PUT.
Controllers: Procedures accessed using POST.
URI path design: Reflects the resource model, with variable path segments and query terms.

1.8.6.5 CRUD with REST

In RESTful web services, requests are made by a requestor to a representation of a resource. The HTTP methods (GET, POST, PUT, DELETE) are used to perform different actions on the resource:

GET: Reads the resource and does not change the server state (idempotent).
POST: Creates a new resource on the server.
PUT: Updates an existing resource on the server.
DELETE: Deletes a resource on the server.

The separation of updates and reads is a principle of good software design that has been around for a long time. It is known as the “command-query separation principle” and was made a requirement in the Eiffel programming language.

1.8.6.6 RESTful Web Features

RESTful web services have four key characteristics:

Use the HTTP protocol in a CRUD-like manner, with HTTP methods corresponding to different actions on resources.
Are stateless, meaning that the client and server do not maintain a persistent connection or state information between requests.
Use meaningful URIs to represent objects and their relationships in the form of directory entries, with relationships typically being parent/child or general/specific entity relations.
Use XML or JSON as a transfer format and content negotiation with mime types.

1.8.6.7 Critical Points with REST

Delivery of requests: How to handle at-least-once or at-most-once delivery and transactions.
Security: How to secure requests and delegate security to backend systems (e.g. bearer tokens).
Performance: How to optimize performance over HTTP, especially with large amounts of data or many round-trips.
Single responsibility: How to avoid the service becoming heavy, kludgy, and serve more than a single responsibility over time.

1.8.6.8 GraphQL as a REST Alternative

Avoids over- and under-fetching of data by allowing the client to specify exactly what data it needs in a single request.
Reduces the number of requests needed by allowing the client to request all the necessary data in a single request.
Uses a single endpoint with resolvers to handle all requests.
Uses a uniform syntax for both data and queries, making it easy to use and understand.
Is typed to prevent mistakes and ensure that the correct data is returned.
Supports federated servers, allowing multiple servers to be combined and accessed through a single endpoint.
Has the potential for huge queries, which can be a danger if not carefully managed.

1.8.7 Microservices

1.8.7.1 The Reasons for Microservice Adoption

Ultra large-scale sites require efficient horizontal scaling.
Unicorn companies (successful startups) need to develop new features quickly with independent teams.
Unicorn companies need to deploy new features quickly due to competition and the need for experimentation.
Unicorn companies need to offer an API for network effects.

1.8.7.2 Scalability Problems of Monolithic Applications

Monolithic applications can only be deployed as a whole, making it difficult to scale specific components.
The central database of a monolithic application can be hard to scale.
The API of a monolithic application can be hard to scale.
Developers are dependent on the general release plan for the entire application, which can be inflexible and slow.

1.8.7.3 Scalability Benefits of Microservices

Individual functions can be deployed independently, making it easy to perform A/B testing.
Independent teams and releases are possible, allowing for more flexibility and faster development.
Databases are often independent due to sharding, making it easy to scale them.
The API can be quickly scaled to meet demand.

1.8.7.4 The Microservice Ecosystem

Continuous integration/deployment allows for rapid experimentation.
Fully automated build and deploy processes ensure efficient development and deployment.
A variety of programming languages can be used.
Continuous monitoring tools, such as ELK, help identify and troubleshoot issues.
Both REST APIs and RPC tools can be used to communicate between microservices.
Containers are preferred over virtual machines (VMs) because they are more lightweight and efficient.
DevOps teams are responsible for operations.
Site-reliability engineers (SREs) oversee the performance and reliability of the system.
Distributed transactions are avoided.
Federated security ensures the security of data and communication across multiple microservices.
Fault-tolerance patterns ensure that the system can continue to operate even if individual microservices fail.

1.8.7.5 Microservice Design Patterns

API gateway: Facade to fine-granular services
Client-side discovery: Provided by MS chassis (e.g. spring boot)
Server-side discovery: Services register themselves during startup, or self registration by chassis
Service instance per container: Scales better than VM per service
Serverless deployment: Use of functions as a service (FaaS) to run small pieces of code in response to specific events
Database per service: Each microservice has its own database, with no direct access to the databases of other microservices
Event-driven architecture: Programming without a stack, with changes triggered by events
Event sourcing: Record change events in an event store
CQRS: Separate update and idempotent read operations
Transaction log tailing: Follow transaction log for changes
Database triggers: Put events in events table after changes

1.8.8 Critical Points with Microservices

Cross-concerns, such as transactions, security, performance, and scalability, can be challenging to manage in a microservice system. Site-reliability engineers (SREs) may be responsible for addressing these issues.
Virtual machines (VMs) may be too large for small services with low throughput.
The maintenance of many microservices can be time-consuming and complex.
Monitoring can be challenging due to the large number of independent services and the difficulty in identifying correlations between them.
Different languages and technologies may cause confusion and fragmentation
There is a risk of creating new central bottlenecks, such as an event store.
REST may not always be the best API model and transport for microservices.
Distributed commits must be eventually consistent to ensure that they do not get lost.

1.8.9 Serverless

1.8.9.1 Serverless Definition

Cloud-native platform for short-running, stateless computation and event-driven applications
Scales up and down instantly and automatically
Charges for actual usage at a millisecond granularity
Decouples computation and storage, scales and priced separately
Pays for code execution instead of allocated resources
Allows developers to focus on writing code, not infrastructure management

1.8.9.2 Stateless Applications

Stateless system or component does not maintain state or context between requests
Each request is treated as an independent action, not influenced by previous requests
Configuration information may be stored in classpath or persistent storage
Function has no memory
Change management may be a challenge in stateless systems
Hypercomposition (breaking a system down into smaller, independent components) may be difficult in stateless systems

1.8.9.3 Issues with Serverless

Countless small IAM rules
Coupling with less scaleable components
Slow cold starts
Unclear bug handling
Stateful functions
Limits everywhere
New testing concepts needed
High costs when waiting for something
Unpredictable latency due to the dynamic nature of the environment
Lack of direct addressability of functions
Limited support for common software patterns, such as batch processing
Difficulty debugging, tracing, and monitoring functions
Inefficient storage systems for small objects or frequent access

1.8.9.4 Serverless Design Patterns

Scalable Webhook: Triggers based on external events
Gatekeeper: Controls access to internal resources
Internal API: Provides an interface to internal resources
Internal Handoff: Communicates between serverless functions
Aggregator: Gathers and processes data from multiple sources
Notifier: Sends notifications to external systems or users
FIFOer: Ensures first-in, first-out processing of events
Streamer: Processes data from a stream in real-time
Router: Routes events to the appropriate destination
State Machine: Executes a series of steps based on the current state of an event

1.9 Theoretical Foundations of Distributed Systems

1.9.1 Overview

Basic Concepts
- Distributed Systems Fallacies
- Latency
- Correctness and Liveness
- Time, Ordering and Failures
- The Impossibility of Consensus in Async. DS (FLP)
- The CAP Theorem
Failures, Failure Types and Failure Detectors
Time in Distributed Systems
Ordering and Causality
Algorithms for Consensus in DS
Optimistic Replication and Eventual Consistency

1.9.2 Foundational Concepts

1.9.2.1 The Eight Fallacies of Distributed Computing

The network is reliable: This fallacy assumes that the network will always be available and free of errors or failures, which is not always the case.
Latency is zero: This fallacy assumes that communication between nodes in a distributed system will be instantaneous, but in reality, there is always some latency due to the time it takes for a request to be processed and a response to be received.
Bandwidth is infinite: This fallacy assumes that there is unlimited bandwidth available for communication between nodes, but in reality, bandwidth can be limited by factors such as network congestion or hardware limitations.
The network is secure: This fallacy assumes that the network is completely secure and free from threats such as hackers or malicious software, but this is not always the case.
Topology doesn’t change: This fallacy assumes that the topology of the network, or the way that nodes are connected, will remain constant, but in reality, the topology can change due to factors such as node failures or changes in network configuration.
There is one administrator: This fallacy assumes that there is only one person or entity responsible for managing the network, but in distributed systems, there may be multiple administrators or stakeholders with different roles and responsibilities.
Transport cost is zero: This fallacy assumes that there are no costs associated with communication between nodes, but in reality, there may be costs such as network fees or hardware expenses.
The network is homogeneous: This fallacy assumes that all nodes in the network are identical and operate in the same way, but in reality, nodes can have different hardware, software, and configurations.

1.9.2.2 Analyzing Latency

Know the long-term trends in hardware: Latency can be influenced by the performance of the hardware being used, so it’s important to be aware of trends in hardware development and how they may impact latency.
Understand the problem of deep queuing networks and the solutions: Deep queuing occurs when there are many requests waiting to be processed, leading to longer latency. Understanding this problem and implementing solutions such as load balancing can help reduce latency.
Know your numbers with respect to switching times, router delays, round-trip times, IOPS per device, and perform “back of the envelope” calculations: It’s important to have a good understanding of the specific numbers and metrics related to latency in your system, such as switching times and router delays, and to perform calculations to estimate the impact of these factors on latency.
Understand buffering effects on latency: Buffering, or the temporary storage of data, can impact latency by adding additional processing time. Understanding the effects of buffering on latency can help you optimize your system to minimize this impact.
Include the client side in your calculations: Latency is often impacted by factors on the client side, such as the client’s hardware and network connection. It’s important to consider these factors when calculating latency and optimizing your system to minimize it.

1.9.2.3 Characteristics of Distributed Systems

High complexity: Distributed systems involve a large number of interacting agents, such as servers, clients, and network devices, which can make them complex to design, build, and maintain.
Partial knowledge: In distributed systems, each node or agent typically has only partial knowledge about the state of the system, the actions of other nodes, and the current time. This can make it difficult to coordinate actions and ensure consistency across the system.
Uncertainty: Distributed systems are prone to uncertainty due to factors such as node failures, network delays, and changes in the system’s environment. This uncertainty can make it challenging to predict the behavior of the system and ensure its reliability.

1.9.3 Consistency

1.9.3.1 Liveness vs. Correctness

Correctness and liveness are two important properties of distributed systems that ensure they function as intended and make progress.

Correctness refers to the property that ensures that a system will not exhibit undesirable behaviors, such as incorrect results or incorrect behavior. It can be thought of as the absence of bad things happening in the system.
Liveness, on the other hand, refers to the property that ensures that a system will eventually make progress and achieve its intended goals. It can be thought of as the presence of good things happening in the system.

Both correctness and liveness are based on assumptions about failures and other conditions in the system, such as fairness and the presence of Byzantine errors. Ensuring that a distributed system has both correctness and liveness is critical for its success.

1.9.3.2 Liveness and Correctness in Practice

Here is an example of how correctness and liveness can be defined for an event-based system:

Correctness:

Receive notifications only if subscribed to them: This ensures that a node only receives notifications for events it is interested in.
Received notifications must have been published before: This ensures that notifications are not received before they have been published, which would lead to incorrect behavior.
Receive a notification only at most once: This ensures that a node does not receive the same notification multiple times, which could lead to incorrect behavior.

Liveness:

Start receiving notifications some time after a subscription was made: This ensures that a node will eventually start receiving notifications after it has subscribed to them.

Failure Assumptions: Fail-stop model with fairness

In this example, the system is designed to ensure correctness by limiting the notifications a node receives to those it is subscribed to and ensuring that notifications are received only once. It is designed to ensure liveness by ensuring that a node will eventually start receiving notifications after subscribing. The system also makes assumptions about failures, such as the fail-stop model with fairness, which are used to ensure the correctness and liveness of the system.

1.9.3.3 Timing Models

In distributed systems, timing models refer to the way that events and communication between nodes are synchronized. There are three main types of timing models: synchronous, asynchronous, and partial synchronous.

Synchronous timing models: In synchronous timing models, transmit times are strictly defined and events happen at specific moments. Nodes in a synchronous system can immediately detect when another node has crashed because the system relies on a clock to synchronize events. Examples of synchronous systems include CPUs and other types of hardware.
Asynchronous timing models: In asynchronous timing models, there is no exact time between sending and receiving messages. Messages will “eventually” arrive, but there is no guarantee about when. Because there are no strict timing constraints, a node in an asynchronous system cannot tell whether another node has crashed or is simply very slow to respond. There are no timeouts in asynchronous systems because they would require a clock.
Partial synchronous timing models: Partial synchronous timing models are a combination of synchronous and asynchronous models. These systems are asynchronous, but they are enhanced with local clocks that provide some level of synchronization. This is the model that is typically used for real-world distributed systems, as it allows for the flexibility of asynchronous communication while still providing some guarantees about timing.

1.9.3.4 The Fischer, Lynch and Patterson Result

The FLP (Fischer, Lynch, and Patterson) result is a theoretical result that shows it is impossible to reach consensus in asynchronous distributed systems in certain circumstances.
The result has significant implications for distributed algorithms that rely on consensus, such as leader election, agreement, replication, locking, and atomic broadcast.
The problem is caused by the need for a unique leader to make a decision, but the asynchronous nature of the system can lead to delays and the re-election of new leaders, which can indefinitely delay the decision-making process.
This problem affects most consensus-based distributed algorithms and can result in non-terminating runs where no decision is reached.

1.9.4 The CAP Theorem

1.9.4.1 Overview

States that in the presence of network partitions, a client must choose either consistency or availability, but not both.

Choosing consistency: If the client chooses consistency, they may not get an answer at all.
Choosing availability: If the client chooses availability, they may receive a potentially incorrect answer.

1.9.4.2 Preconditions for the CAP Theorem

To be considered consistent, the system must ensure that all operations are atomic and linearizable, meaning that they can be thought of as occurring at a single instant in time and have a total order.
For a system to be considered available, it must ensure that every request received by a non-failing node is met with a response, and that every request terminates.
Partition tolerance: The network will be allowed to lose arbitrarily many messages sent from one node to another.

1.9.4.3 Common Misconceptions of the CAP Theorem

Consistency: Many systems do not achieve a total order of requests due to the costs (latency, partial results) involved.
Availability: Even an isolated node with a working quorum on the other side must answer requests, breaking consistency. The node does not know that a quorum exists.
Partition Tolerance: You cannot un-choose partition tolerance, as it is always present. CA systems are therefore not possible.

1.9.4.4 The Modern View of the CAP Theorem

There are more failure types than just partition tolerance, such as host-crash and client-server disconnect. These failures cannot be completely avoided.
Many systems do not need linearizability, and it is important to carefully consider the type of consistency that is needed.
Most systems prioritize latency over consistency, with availability coming in second.
A fully consistent system in an asynchronous network is impossible (in the sense of FLP). FLP is much stronger than CAP.
The architecture of the system (replication, sharding) and the abilities of the client (failover) also have an impact on the system’s behavior.

1.9.4.5 PACELC

PACELC: A more complete portrayal of the space of potential consistency tradeoffs for distributed database systems.

In the presence of a partition (P), the system must trade off availability and consistency (A and C).
In the absence of a partition (E), the system can trade off latency (L) and consistency (C) when running normally.

1.9.5 Failure

1.9.5.1 Technical Failures

Network failures, such as partitioning, which can occur when a network is divided into smaller, separate networks that are unable to communicate with each other.
CPU/Hardware failures, such as instruction failures or RAM failures, which can occur when the hardware components of a system malfunction or fail.
Operating system failures, such as crashes or reduced function, which can occur when the operating system experiences an error or malfunction.
Application failures, such as crashes, stopped states, or partially functioning states, which can occur when an application experiences an error or malfunction.

Unfortunately, in most cases there is no failure detection service that can identify when these failures occur and allow others to take appropriate action. However, it is possible to develop a failure detection service that could detect even partitioning and other types of failures through the use of MIBs (Management Information Bases) and triangulation. Applications could also be designed to track themselves and restart if necessary.

1.9.5.2 Failure Types

Bohr-Bug:
- Shows up consistently and can be reproduced. Easy to recognize and fix.
Heisenbug:
- Shows up intermittently, depending on the order of execution.
- High degree of non-determinism and context dependency.
- Due to complex IT environments, they are both more frequent and harder to solve.
- They are only symptoms of a deeper problem.
- Changes to software may make them disappear temporarily, but more changes can cause them to reappear.
- Example: Deadlock “solving” through delays instead of resource order management.

1.9.5.3 Failure Models

Crash-stop: A process crashes atomically and stays down.
Crash-stop with recovery: A process crashes and is down until it begins recovery, and is up again until the next crash occurs. For consensus, at least 2f+1 machines are needed (quorum).
Crash-amnesia: A process crashes and restarts without recollection of previous events or data.
Failstop: A machine fails completely, and the failure is reported to other machines reliably.
Omission errors: Processes fail to send or receive messages even though they are alive.
Byzantine errors: Machines or parts of machines, networks, or applications fail in unpredictable ways and may recover partially. For consensus, at least 3f+1 machines are needed.

Many protocols for achieving consistency and availability make assumptions about failure models. For example, transaction protocols may assume recovery behavior by its participants if the protocol terminates.

1.9.5.4 Failures and Timeouts

Timeouts are not a reliable way to detect failures in distributed systems because they can be caused by various factors, such as short interruptions on the network, overload conditions, and routing changes.
Timeouts cannot distinguish between different types and locations of failures.
Timeouts cannot be used in protocols that require failstop behavior of its participants.
Most distributed systems only offer timeouts for applications to notice problems, so they do not provide detailed information about the state of participants or membership.
Using timeouts can result in “split-brain” conditions, where a system behaves as if it is functioning properly but is actually experiencing a failure or malfunction.

1.9.5.5 Failure Detectors

A failure detector (FD) is a mechanism used in distributed systems to detect failures of processes or machines. It is not required to be correct all the time, but it should provide the following quality of service (QoS):

Safety: The FD should be safe all the time, meaning it should not falsely suspect processes of being faulty during “better” failure periods.
Liveness: The FD should be live during “better” failure periods, meaning no process should block forever waiting for a message from a dead coordinator.
Accuracy: Eventually, some process x should not be falsely suspected of being faulty. When x becomes the coordinator, every process should receive x’s estimate and make a decision based on it.
Low overhead: The FD should not cause a lot of overhead, meaning it should not consume too many resources or slow down the system.

1.9.6 Clocks

1.9.6.1 Time in Distributed Systems

In distributed systems, there is no global time that is shared across all processes. Instead, different approaches are used to model time in these systems. These approaches include:

Event clock time: This is a logical model of time that represents the order of events within a single process.
Vector clock time: This is a logical model of time that represents the order of events between multiple processes.
TrueTime: This is a physical model of time that represents the interval of time between events.
Augmented time: This is a combination of physical and logical models of time that takes into account both the interval of time between events and the order of events.

Logical time is modeled as partially ordered events within a process or between processes. It is used to represent the relative order of events in a distributed system, rather than the actual clock time at which they occurred.

Causal meta-data in the system can also order the events properly.

1.9.6.2 Consistent vs. Inconsistent Cuts

Consistent cuts: These cuts produce causally possible events, meaning that events occur in a logical and possible order.
Inconsistent cuts: These cuts produce events that arrive before they have been sent, resulting in an illogical or impossible order.

1.9.6.3 Event Clocks (Logical Clocks)

Event clocks, also known as logical clocks, are systems for ordering events within processes according to a chosen causal model and granularity.
Events are partially ordered based on the order in which they occur. The time between events is a logical unit of time that has no physical extension.
Events delivered through messages can be used to relate the processes and their times to the events. The external order of these events is also a partial order of events between processes (for example, the event “send(p1,m)” occurs before the event “recv(p2,m)”).
The value of the logical clock is updated to the maximum of the current value plus one or the received value.

1.9.6.4 Lamport Logical Clock

The Lamport logical clock counts events and creates an ordering relation between them. These counters can be used as timestamps on events.
The ordering relation captures all causally related events, but it also includes many unrelated (concurrent) events, which can create false dependencies.

1.9.6.5 Vector Clocks

Vector clocks are transmitted with messages and compared at the receiving end.
If, for all positions in two vector clocks A and B, the values in A are larger than or the same as the values from B, we say that Vector Clock A dominates B.
This can be interpreted as potential causality to detect conflicts, as missed messages to order propagation, etc.

1.9.6.6 Physical Interval Time (TrueTime)

TrueTime works by using time servers to check for rogue clocks, which are clocks that are not synchronized with the correct time.
The error in TrueTime is typically in the range of 6 milliseconds.

1.9.6.7 Hybrid Clocks

Hybrid clocks are systems that combine elements of both logical and physical models of time in order to address the limitations of each approach. There are several reasons why hybrid clocks may be used in distributed systems:

In large distributed systems, such as those spanning multiple data centers across the world, vector clocks can become too large to maintain efficiently. Hybrid clocks can be used to reduce the size of the clocks while still maintaining an accurate ordering of events.
Physical interval time, such as TrueTime, requires that reads and writes wait until the interval time is over on all machines. This can be inefficient in some cases, and hybrid clocks can be used to allow for more flexibility in terms of when reads and writes can occur.

1.9.6.8 Ordering in Distributed Event Systems

FIFO (first-in, first-out) ordering: This refers to the requirement that a component must receive notifications in the order in which they were published by the publisher.
Causal ordering: This refers to the requirement that events must be ordered based on their causal relationships, as defined by the system.
Total ordering: This refers to the requirement that events must be ordered in a specific way, such that no other component in the system is allowed to receive an event before a preceding event has been received. One component may be responsible for deciding the global order of events in this case.

1.9.7 Consensus

1.9.7.1 Overview

Consensus is a process used by a group of processes to reach agreement on a specific value, based on their individual inputs. The objective of consensus is for all processes to decide on a value v that is present in the input set.

Termination: Every correct process eventually decides on some value.
Validity: If a process decides on a value v, then v was proposed by some process.
Integrity: No process decides on a value more than once.
Agreement: No two correct processes decide on different values.

1.9.7.2 Algorithms for Consensus

These protocols offer trade-offs in terms of correctness, liveness (availability and progress), and performance:

Two-Phase Commit (2PC): This algorithm is used to ensure that a group of processes all commit to the same decision. It involves two phases: a prepare phase, in which the processes prepare to commit to a decision, and a commit phase, in which they actually commit to the decision.
Static Membership Quorum: This algorithm is based on the concept of quorum, which refers to a minimum number of processes that must be present in order to reach a decision. The static membership quorum algorithm is used to achieve consensus in systems with a fixed number of processes.
Paxos: This algorithm is used to achieve consensus in distributed systems with a dynamic membership. It involves multiple rounds of voting in order to reach a decision.
Raft: This algorithm is similar to Paxos, but it is designed to be easier to understand and implement.
Dynamic Group Membership: This class of algorithms is used to achieve consensus in systems with a dynamic membership. These algorithms include virtual synchrony and multicast-based approaches.
Gossip Protocols: These algorithms are used to disseminate information between processes in a distributed system. They can be used to achieve consensus by allowing processes to exchange information and reach agreement on a decision.

1.9.7.3 Two-Phase Commit (2PC)

Steps:

Preparation phase: In this phase, the processes prepare to commit to a decision. Each process sends a “prepare to commit” message to a coordinator process, which is responsible for coordinating the decision-making process.
Decision phase: Once all the processes have prepared to commit, the coordinator process sends a “commit” message to all the processes. This message instructs the processes to commit to the decision.
Confirmation phase: Each process sends a “commit confirmation” message to the coordinator process, indicating that it has successfully committed to the decision.
Finalization phase: Once all the processes have confirmed that they have committed to the decision, the coordinator process sends a “commit complete” message to all the processes, indicating that the decision has been successfully made.

Example:

Imagine that there are three processes in a distributed system: A, B, and C.
The coordinator process is A.
The processes are deciding whether to commit to a new software update.
In the preparation phase, A sends a “prepare to commit” message to B and C.
B and C send “prepare to commit” messages back to A.
In the decision phase, A sends a “commit” message to B and C.
In the confirmation phase, B and C send “commit confirmation” messages back to A.
In the finalization phase, A sends a “commit complete” message to B and C, indicating that the decision to commit to the software update has been successfully made.

Liveness and Correctness:

The two-phase commit protocol (2PC) allows for atomic (linearizable) updates to be made by participating processes.
2PC is considered relatively expensive in terms of latency due to the requirement for a persistent log at each participant.
A crash failure of the coordinator or a participant can halt or disrupt the execution of 2PC.
If everything goes as planned, 2PC has a clear and easy-to-understand semantic for application developers.

1.9.7.4 Quorum-Based Consensus

Steps:

A decision is proposed by one of the processes in the distributed system.
Each process in the system votes on the proposed decision.
The votes are counted and checked against the quorum requirement. The quorum requirement is the minimum number of votes that must be received in favor of the decision in order for it to be approved.
If the quorum requirement has been met, the decision is considered to have been approved.
The processes move forward with implementing the decision.

Example:

A group of five processes in a distributed system (A, B, C, D, and E) are deciding whether to commit to a new software update.
Process A proposes the decision to update the software.
Processes B, C, D, and E all vote on the proposed decision.
The votes are counted and checked against the quorum requirement, which is set at three votes.
Since a quorum of three votes has been received in favor of the decision, it is considered to have been approved.
The processes move forward with implementing the software update.

Liveness and Correctness:

Quorum-based consensus protocols (QP) allow for atomic (linearizable) updates to be made by participating processes, but this depends on the type of implementation being used (e.g., whether read quoras are used).
QP are known to be relatively expensive in terms of latency, as even simple reads may require a quorum for consistency. To improve performance, many QP systems use a leader-based approach, which involves routing client requests through a designated leader process.
Leader crashes are usually handled through the use of leases, which can cause delays.
In the case of a crash failure, QP can continue as long as a quorum is still possible.
Network partitions may cause the system to either respond with all non-failing nodes (which may sacrifice consistency) or to stop responding to requests from minority nodes (which may sacrifice availability).
QP offer a reliable and efficient method for achieving consensus in distributed systems, but they also come with some trade-offs in terms of performance and fault tolerance.

1.9.7.5 Paxos

Steps:

Prepare (Phase 1a):

The Proposer (also known as the leader) selects a proposal number N and sends a Prepare message to a quorum of Acceptors.

Promise (Phase 1b):

If the proposal number N is larger than any previous proposal, the Acceptors promise not to accept proposals less than N and send the value they last accepted for this instance to the Proposer.
If the proposal number is not larger, a denial is sent.

Accept! (Phase 2a):

If the Proposer receives responses from a quorum of Acceptors, it may choose a value to be agreed upon.
If any of the Acceptors have already accepted a value, the leader must choose a value from this set. Otherwise, the Proposer is free to propose any value.
The Proposer sends an Accept! message to a quorum of Acceptors with the chosen value.

Accepted (Phase 2b):

If the Acceptor receives an Accept! message for a proposal it has promised, it accepts the value and sends an Accepted message to the Proposer and every Learner.

Example:

Let’s say there are 5 replicas in the Paxos system: A, B, C, D, and
A client wants to update the value of a shared key-value store from “apple” to “banana”.

Prepare (Phase 1a):

The Proposer (also known as the leader) selects a proposal number N and sends a Prepare message to a quorum of Acceptors (in this case, a quorum is defined as 3 or more replicas).
The message includes the proposal number N and the value “banana” that the client wants to update the key-value store to.
The Acceptors respond to the Prepare message with either a Promise message or a Deny message.

Promise (Phase 1b):

If an Acceptor receives a Prepare message with a proposal number N that is higher than any previous proposal number it has seen, it sends a Promise message to the Proposer and includes the value it last accepted for this instance (if it has accepted a value in the past).
If the proposal number N is not higher than any previous proposal number, the Acceptor sends a Deny message.

Accept! (Phase 2a):

If the Proposer receives Promise messages from a quorum of Acceptors, it may choose a value to be agreed upon.
If any of the Acceptors have already accepted a value, the Proposer must choose a value from this set. Otherwise, the Proposer is free to propose any value.
The Proposer sends an Accept! message to a quorum of Acceptors with the chosen value.

Accepted (Phase 2b):

If an Acceptor receives an Accept! message for a proposal it has promised, it accepts the value and sends an Accepted

1.9.7.6 Raft

RAFT is a distributed consensus protocol that allows a group of processes (called “replicas”) to agree on a value (“decide”) in the presence of failures. RAFT is divided into three distinct roles: Leader, Follower, and Candidate.

The protocol consists of the following steps:

Leader Election:

When a replica starts up or its leader fails, it becomes a Candidate and initiates an election by sending RequestVote messages to all other replicas.
If a Follower receives a RequestVote message from a Candidate with a higher term, it responds with its vote and updates its term to match the Candidate’s term.
If a Candidate receives a quorum of votes (more than half of the replicas), it becomes the Leader and sends AppendEntries messages to all other replicas to replicate its log.

Log Replication:

The Leader sends AppendEntries messages to all other replicas to replicate its log.
If a Follower’s log is missing an entry preceding the one in the Leader’s message, it responds with a missing entry error.
If the Follower’s log is up-to-date and the Leader’s entry is valid, the Follower appends the entry and responds with a success message.
If the Leader receives a success message from a quorum of replicas, it updates its commit index and sends a Commit message to all other replicas to apply the committed entry to their state machines.

State Machine Update::

If a replica receives a Commit message, it applies the committed entry to its state machine and responds with an Apply message to the Leader.
If the Leader receives an Apply message from a quorum of replicas, it updates its commit index and sends an Apply message to all other replicas.

1.9.8 Broadcasting

1.9.8.1 Atomic Broadcast Conditions

A distributed algorithm that guarantees correct transmission of a message from a primary process to all other processes in a network or broadcast domain, including the primary.

It satisfies the following conditions:

Validity: If a correct process broadcasts a message, then all correct processes will eventually deliver it
Uniform Agreement: If a process delivers a message, then all correct processes eventually deliver that message
Uniform Integrity: For any message m, every process delivers m at most once, and only if m was previously broadcast by the sender of m
Uniform Total Order: If processes p and q both deliver messages m and m0, then p delivers m before m0 if and only if q delivers m before m0

It is widely used in distributed computing for group communication and defined as a reliable broadcast that satisfies total order.

1.9.8.2 Atomic Broadcast Protocol

Data:

Epoch (e): The duration of a specific leadership
View (v): Defined membership set that lasts until an existing member leaves or comes back
Transaction counter (tc): Counts rounds of execution, such as updates to replicas

Phases:

Leader election/discovery: Members decide on a new leader and form a consistent view of the group.
Synchronization/recovery: Leader gathers outstanding, uncommitted requests recorded at members and updates members missing certain data until all share the same state.
Working: Leader proposes new transactions to the group, collects confirmations, and sends out commits.

Frequent leader changes can cause overhead and may be a potential denial of service. It is important to consider latency on the leader node when implementing an atomic broadcast protocol.
Paxos, Raft etc. are Atomic Broadcast Protocols!

1.9.8.3 Gossip Protocols

Gossip protocols are a class of distributed algorithms that rely on randomly chosen pairs of nodes in a network to exchange information about the state of the system. They are typically used for group membership, failure detection, and dissemination of information.

There are several key characteristics of gossip protocols:

Randomized: Gossip protocols rely on randomly chosen pairs of nodes to exchange information, which helps to reduce the risk of overloading any particular node.
Scalability: Gossip protocols scale well in large, distributed systems because they only require communication with a few nodes at a time.
Fault tolerance: Gossip protocols are designed to tolerate failures and can continue to operate even if some nodes go down.
Asynchronous: Gossip protocols do not rely on a central authority or global clock, so they can operate asynchronously in a distributed system.

1.9.8.4 DWAL

A DWAL (Distributed Write-Ahead-Log) is a data structure that is used to ensure that updates to a distributed system are stored in a way that allows them to be recovered in case of system failure. It is a type of write-ahead log, which means that updates are written to the log before they are applied to the system’s state. This allows the updates to be replayed in the correct order after a system failure.

1.9.8.5 Design Components of DWALs

Global visibility: Replicated state should be visible to all processes in the system. This can be achieved through the use of atomic broadcast or other consensus protocols to ensure that all processes have a consistent view of the system state.
Consensus protocol: A consensus protocol such as Paxos or Raft is used to ensure that all processes agree on the order of updates to the replicated state. This ensures that all processes have a consistent view of the system state and reduces the risk of conflicts or data loss.
Majority decisions: In a consensus protocol-based system, majority decisions are used to ensure that the system can make progress even in the presence of failures. This means that a majority of processes must agree on the order of updates to the replicated state before they can be applied.
Group membership: In order to ensure that the DWAL can function properly, it is important to have a mechanism in place for maintaining an up-to-date view of the membership of the group of processes.
Message order and latency hiding: To ensure that the DWAL can function effectively, it is important to ensure that updates are delivered to all processes in a consistent order.

1.9.9 Replication

1.9.9.1 Properties of Replication Models

Who is responsible for updates: Single master or multiple masters?
What is being updated: State transfer or operation transfer?
How updates are ordered
Conflict detection and resolution
Method for updating replica nodes
Guarantees for divergence

1.9.9.2 Single-Leader Replication

Steps:

Client sends x=5 to Node1 (master)
Master updates node 2 and node 3 (followers)
Client receives changed value (or old value; due to replication lag)

Advantages:

Ordered updates
Efficient caching
Highly available reads

Disadvantages:

Replicas may be out of sync with the master
Leader crash may cause problems
Followers may take a while to take over in case of leader failure
Not suitable for critical resources such as primary keys

1.9.9.3 Eventually Consistent Reads

Eventual consistency model: Allows for a certain level of lag between updates to be propagated to all replicas

Steps:

Client updates value on Master-Replica node
Master-Replica eventually propagates update to Slave replica
Client performs a stale read from client node, potentially returning outdated value

1.9.9.4 Multi-Master Replication

Multi-Master Replication (MMR) is a type of replication in which multiple servers can accept write requests, allowing any server to act as a master. This means that updates can be made to any server, and the changes will be replicated to all other servers in the network. MMR can be used to improve the availability and scalability of asystem, as it allows updates to be made to any server and allows multiple servers to handle write requests.

It also introduces the possibility of conflicts, as multiple servers may receive updates to the same data simultaneously. To resolve these conflicts, MMR systems typically use conflict resolution algorithms (last writer wins, keeping different versions, anti-entropy background merge/resolve) or allow the user to manually resolve conflicts.

Steps:

Client 1 writes x=5 to Node 1 (master)
Client 2 writes x=10 to Node 2 (master)
The masters detect a conflict

Conflict types:

Update conflict: Two identical rows changed on two servers
Uniqueness conflict: Two identical primary (uniqe) keys added in same table on two servers
Delete conflict: During delete of a row the same row is changed on a different server.

1.9.9.5 Leaderless Quorum Replication

Write:

In a leader-less (quorum) replication system, the client decides how many machines to write to or read from using the formula W+R>N, where N is the number of machines in the replication group.
Without a designated leader, quorum systems may suffer from long tail effects.
If a quorum is not available, the client can choose to write to a “sloppy quorum” and risk the write being lost.
Without anti-entropy, there is a high risk of partial writes in the system, which can lead to inconsistencies and can be difficult to clean up.

Read:

Some systems may detect inconsistencies during a read operation.
These systems can either automatically perform a cleanup (e.g. using version numbers to return the correct value) or offer both values for the client to choose from.

1.9.9.6 Session Modes of Asynchronous Replication

The following guarantees seem to enable “sequential consistency” for a specific client, meaning that the program order of updates from this client is respected by the system. Clients can track these guarantees using vector clocks:

“Read your writes” (RYW) ensures that the contents read from a replica include previous writes by the same user.
“Monotonic reads” (MR) ensures that successive reads by the same user return increasingly up-to-date contents.
“Writes follow reads” (WFR) ensures that a write operation is accepted only after writes observed by previous reads by the same user are incorporated in the same replica.
“Monotonic writes” (MW) ensures that a write operation is accepted only after all write operations made by the same user are incorporated in the same replica.

We can also derive session anomalies from this:

Non-monotonic reads: Reads that do not return increasingly up-to-date contents.
Non-monotonic writes: Write operations that are not accepted in the order that they were made by the same user.
Non-monotonic transactions: Transactions that do not preserve the order in which they were made by the same user.
Not-reading-my-writes: Reads that do not include previous writes by the same user.

1.9.9.7 Global Modes of Replication

There are multiple different modes to choose from:

Strong consistency ensures that all previous writes are visible, and is characterized by the following properties:
- Ordered: Writes are accepted in the order that they were made.
- Real: All writes are visible.
- Monotonic: Write operations are accepted only after all previous write operations made by the same user are incorporated in the same replica.
- Complete: All writes are included.
Consistent prefix ensures that an ordered sequence of writes is visible, and is characterized by the following properties:
- Ordered: Writes are accepted in the order that they were made.
- Real: All writes are visible.
- X latest missing: Some of the latest writes may be missing.
- Snapshot isolation-like: The system behaves like snapshot isolation, where writes made by a transaction are not visible to other transactions until the transaction is committed.
Bounded staleness ensures that all writes older than X or every write except the last Y are visible, and is characterized by the following properties:
- Ordered: Writes are accepted in the order that they were made.
- Real: All writes are visible.
- X latest missing: Some of the latest writes may be missing.
- Monotonic increasing due to bound: Write operations are accepted only after all previous write operations made by the same user are incorporated in the same replica, but the bound on staleness allows for some deviation from strict monotonicity.
Eventual consistency ensures that a subset of previous writes is visible, and is characterized by the following properties:
- Unordered: Writes may not be accepted in the order that they were made.
- Un-real: Some writes may not be visible.
- Incomplete: Some writes may be missing.

Each of them have their own trade-offs:

Strong consistency:
- Consistency: Excellent
- Performance: Poor
- Availability: Poor
Eventual consistency:
- Consistency: Poor
- Performance: Excellent
- Availability: Excellent
Consistent prefix:
- Consistency: Okay
- Performance: Good
- Availability: Excellent
Bounded stalenes:
- Consistency: Good
- Performance: Okay
- Availability: Poor
Monotonic reads:
- Consistency: Okay
- Performance: Good
- Availability: Good
Read my writes:
- Consistency: Okay
- Performance: Okay
- Availability: Okay

1.10 Distributed Services and Algorithms I

1.10.1 Overview

Distributed Services
1. Replication, Availability and Fault-Tolerance
2. Global Server Load Balancing for PoPs
Typical Cluster Services
1. Fail-over and Load-Balancing
2. Directory Services
3. Cluster Scheduler (neu)
Distributed Operating Systems
Example Services and Algorithms
1. Distributed File System with replication and map/reduce
2. Distributed (streaming) Log
3. Distributed Cache with consistent hashing
4. Distributed DB with sharding/partitioning functions
5. Distributed Messaging with event notification and gossip

1.10.2 Types of Distributed Services

1.10.2.1 What is a Distributed Service?

Function provided by a distributed middleware with:

High scalability
High availabilit

1.10.2.2 Services and Instances

Distributed systems:
- Comprised of services, such as applications, databases, caches, etc.
- Services are made up of instances or nodes, which are individually addressable hosts (physical or virtual)
Key observation:
- Unit of interaction is at the service level, not the instance level
- Concerned with logical groups of nodes, not specific instances
- Example: Interacting with a database server, rather than a specific database instance.

1.10.2.3 Core Distributed Services

Finding Things
- Name Service
- Registry
- Search
Storing Things
- Various databases
- Data grids
- Block storage, etc.
Events Handling and asynchronous processing: Queues
Load Balancing and Failover
Caching Service
Locking Things and preventing concurrent access: Lock service
Request scheduling and control: Request multiplexing
Time handling
Providing atomic transactions: Consistency and persistence
Replicating Things: NoSQL DBs
Object handling: Lifecycle services for creation, destruction, relationship service, etc.

1.10.3 Availability

Is defined as:

Availability = \frac{Uptime_{agreed\ upon} - Downtime_{planned\ and\ unplanned}}{Uptime_{agreed\ upon}}

Continuous availability does not allow for planned downtime.

1.10.3.1 Typical Hardware Causes for Downtime

Overheating
PDU failure
Rack-move
Network rewiring
Rack failures
Racks go wonky
Network maintenances
Router reloads
Router failures
Individual machine failures
Hard drive failures

1.10.3.2 Availability through Redundancy

Across groups of resources:

Multi-site data center
Disaster recovery
Scalability

Within a group of resources:

High availability (HA)
Clustering
Centralized administration (CA)
Automatic failover (CO)
Scalability
Data replication
Quorum algorithms (require multiple machines)

Between two resources:

High availability (HA)
Centralized administration (CA)
Automatic failover (CO)
Load distribution to prevent overload in case of failure

For an individual resource:

Single point of failure (SPOF)
Easy updates
Maintenance problems
Simple reliability
Limited vertical scalability

1.10.3.3 3 Copy Disaster Recover Solution

Maintains 3 copies of data/resources with at least 2 in different locations
Enables quick switchover in case of disaster for business continuity
Provides high availability and protects against data loss.

1.10.4 Load Balancing

1.10.4.1 Serial vs. Redundant Availability

Serial chain of components:
- Availability decreases with more members in the chain
- Individual components need higher availability
Redundant, parallel components:
- Unavailability of each component is multiplied and subtracted from 1 to determine overall availability
- Only one component needs to be up to maintain availability.

1.10.4.2 Global Server Load Balancing

DNS Round Robin:
- Simple load balancing technique that distributes traffic to multiple servers based on the client’s DNS query
- Little mitigation in case of problems like overload, failure, etc.
- Clients may disregard TTL settings
- Takes approximately 15 minutes to drain traffic from troubled servers.
BGP Anycast:
- Uses BGP routing to direct clients to the nearest available server
- BGP does not consider link latency, throughput, packet loss, etc. in selecting the best route
- With multiple routes to the destination, BGP simply selects the one with the least number of hops
- Troubleshooting can be demanding.
Geo-DNS:
- Uses the client’s geographical location to determine the closest server for traffic distribution
- Relies on the accuracy of the DNS provider’s IP and location guesswork
- TTL setting may not accurately reflect the time-to-live of cached information.
Real User Measurements (RUM):
- Uses real-time data from end-user devices to dynamically adjust traffic distribution for optimal performance.

1.10.4.3 Failover with Virtual IPs

Failover with one virtual IP:

DNS points only to one Virtual IP (VIP)
In case of a server failure, client sessions are lost but they can establish a new session on reconnect
No changes in DNS are required, avoiding the potential issues of flushes and timeouts

Multi-site failover:

A combination of geo-aware DNS and a Load Balancer/Fail-over front-server
Requests can be re-routed to different locations in case of a server failure
May still have the limitations and issues associated with geo-aware DNS.

1.10.4.4 Failover, Load Balancing and Session State

Sticky Sessions: Keeps session state on a single server, offers advantages with a non-replicated system of records but limited in terms of fail-over and load-balancing options.
Session Storage in DB: Session state is stored in a database, offers better scalability and fail-over options compared to sticky sessions.
Session Storage in Distributed Cache: Session state is stored in a distributed cache, provides better performance and scalability compared to database storage, but still with fail-over options.

Today: Stateless servers with state in DB are the norm, but sticky sessions are still useful because records need to be replicated.

Compromise: Replicate sessions between pairs of servers, then enable switching between them as failovers

1.10.4.5 P2P Load Balancing

Evaluator Functions: Access server stats in shared memory and determine the outcome of a request, whether it is handled by its own server, redirected, or proxied.
Server Stats: Various metrics such as CPU usage, number of requests, memory usage, etc., are replicated in shared memory and used by evaluator functions to make load-balancing decisions.
Server Stat Replication: The replication of server stats is done through multicast.

1.10.5 Service Organization

1.10.5.1 Requirements of Distributed Name/Directory Systems

Functional Requirements:

Re-bind/resolve methods to store name/value pairs
Query Interface
Name Aliases for multiple logical hierarchies (DAG-structured name space)
Composite Names (path names)
Location Independence by separating address and location of objects
Sound security system to ensure access to objects based on names.

Non-Functional Requirements:

Definition: Non-Functional Requirements are necessary functions of a system that ensure speed, reliability, availability, security, etc. Many systems fulfill functional requirements but fail to meet non-functional requirements.

Persistent Name/Value Combinations (retain mapping in case of a crash)
Transacted Manipulation (all-or-nothing handling during application installation)
Federated Naming Services to combine different naming zones
Fault tolerance through replication for availability
Fast lookup through clustering, with slower writes and client-side caching support to reduce communication costs

1.10.5.2 Design of Distributed Name/Directory Systems

Name Space:

The company name space organization is a design and architecture concern.
System Management enforces rules, policies, and controls changes.
Different machines host parts of the name space (zones).

Naming Service:

Interfaces (naming service, factoryfinder, factory, account) allow administrators to hide versions/implementations from clients.
Finders in a remote environment support object migration and copying.

1.10.5.3 Examples of Naming Services

Domain Name System (DNS)
X.500 Directory
Lightweight Directory Access Protocol (LDAP)
CORBA Naming Service
Java Registry
J2EE JNDI (mapped to CORBA Naming Service)

1.10.5.4 Google vs. Amazon

Amazon:

Services

Storage Services:
- S3 (huge storage capacity)
- RDS
- Aurora
- Mongo
Computation Services: EC2 (Virtual Machines)
Queuing Services
Load-balancing services
Elastic map reduce
Cloudfront (fast cache)
Lambda
Stepfunctions
AI framework services

Google:

Products

Scheduling Service
Map/Reduce Execution Environment
F1 NewSQL DB
Spanner Replicated DB
BigTable NoSQL Storage
Distributed File System
Chubby Lock Service
Graph Processing and SQL:
- Pregel
- Percolator
- Dremel
Dapper (distributed locking)
Google Compute Platform

1.10.5.5 Best Practices for Designing Services

Keep services independent
Measure services
Define SLAs and QoS for services
Allow agile development of services
Allow hundreds of services, aggregate them on special servers
Avoid middleware and frameworks that force patterns
Keep teams small and organized around services
Manage dependencies carefully
Create APIs for customer access to services

1.10.6 Caching

1.10.6.1 CQRS (Command Query Responsibility Segregation)

Separates the responsibilities of reading data (queries) and modifying data (commands) into separate objects or services.
Improves scalability and performance by allowing reads and writes to be optimized separately.
Promotes event-driven architecture by allowing commands to trigger domain events.
Simplifies domain modeling by reducing the complexity of aggregates.
Increases consistency by using separate models for reads and writes.
Reduces the coupling between the read and write sides of the system.

1.10.6.2 Requirements of Caching Services

Scalable with ability to add machines
Avoid “thundering herds” due to placement changes
Supports replication of cache entries
High performance required
Optional disk backup support
Supports various storage mediums, from RAM to SSD
Supports different cache replacement policies with caution.

1.10.6.3 Handling Changing Machine Counts in Caching Services

Problem: Changing Machine Count
Solution: Consistent Hashing (Ring)
Machines are mapped into a ring and their position determines the key-space they are responsible for.
Machines can be assigned multiple virtual positions.

1.10.6.4 Consistent Hashing Algorithms

Simple Consistent Hashing Algorithm:

URLs and caches are mapped to points on a circle using a standard hash function.
URL assigned to closest cache in clockwise direction.
Adding a new cache only reassigns the closest URLs to it, items don’t move between existing caches.

Dynamo Consistent Hashing Algorithm:

Separates placement and partitioning.
Uses virtual nodes assigned to real machines for more flexibility.
Virtual node is responsible for multiple real nodes.
Improved load balancing due to additional indirection.

1.10.6.5 Cache Patterns

Pull:

Occurs during request time
Concurrent misses and client crashes can result in outdated caches
Complicated handling of concurrent misses and updates
Can be slow and dangerous for backends

Push:

Automated push updates cached values
Should only be used for values that are always needed

Pre-warmed:

The system loads the cache before the application starts serving clients
Used for big applications with pull caches to avoid boot issues

General consideration: Be aware of LRU or clocked invalidations as cache is mission critical.

1.10.6.6 Cache Design Considerations

Kinds of information fragments
Lifecycle of fragments
Validity of fragments
Effects of fragment invalidation
Dependencies between fragments, pages, etc.

1.10.7 Events

1.10.7.1 Local vs. Distributed Events

Local:

Observer updates sent on one thread
If observer doesn’t return, mechanism stops
If observer calls back to observed during update call, deadlock can occur
Solution doesn’t scale and is not reliable (e.g. observer crashes result in lost registrations)
Does not work for remote communication

Distributed:

Various combinations of push and pull models possible
Receivers can install filters using a constraint language to filter content (reduces unwanted notifications)

1.10.7.2 Asynchronous Event-Processing

Publisher and subscriber communicate through interaction middleware
Used to decouple components and asynchronous sub-requests from synchronous main requests (so that multiple fast tasks can run parallel to a slow main task)
Implemented as Message-Oriented-Middleware (MOM) or socket-based communication library
Can be implemented in broker-less or brokered mode.

1.10.7.3 Features of Event-Driven Interaction

Basic Event: Any entity can send and receive events without restrictions or filtering.
Subscription: A receiver can subscribe to specific events, making event delivery more efficient.
Advertisement: The sender informs receivers about possible events, reducing the need for broadcasting.
Content-Based Filtering: The sender, middleware, or receiver can apply filtering based on event content.
Scoping: Administrative components can manipulate event routes, enabling invisible communication between components.

1.10.7.4 Types of Message Oriented Middleware (MOMs)

Centralized Message-Oriented-Middleware:

Collects all notifications and subscriptions in one central place, enabling easy event matching and filtering
Has a high degree of control and no security/reliability issues on clients
Can create scalability and single-point-of-failure problems

Clustered Message-Oriented-Middleware:

Provides scalability at higher communication costs
Has lots of routing/filter-tables at cluster nodes, making filtering and routing of notifications expensive

Simple P2P Event Libraries:

Local libraries are aware of each other, but components are de-coupled
Broker-less architecture is faster than brokered ones
Does not provide at-most-once semantics or protection against message loss
Guarantees atomicity and possibly FIFO
Examples include ZeroMQ, Aaron, and Nanomsg

Flooding Protocols:

Notifications travel towards subscriptions, which are only kept at leaf brokers
Advantages include that subscriptions become effective quickly, and notifications are guaranteed to arrive everywhere
Price is many unnecessary notifications to leaf nodes without subscribers

1.10.7.5 ZeroMQ

Brokerless socket library for messaging, with message filtering
Connection patterns include pipeline, pub/sub, and multi-worker
Various transports, including in process, across local process, across machines, and multicast groups
Message-passing process model without the need for synchronization
Multi-platform and multi-language support
“Suicidal snail” fail-fast mechanism to kill slow subscribers

1.10.8 Sharding

1.10.8.1 Horizontal vs. Vertical Sharding

Horizontal: Per (database) row, e.g. first 100 users are in shard 1, 200 in shard 2 etc.

Vertical: Per (database) column, e.g. profile and email is in shard 1, photos and messages in shard 2 etc.

1.10.8.2 Sharding Strategies

Allow adding heterogenous hardware in the future
Sharding should not make app code unstable
Sharding should be transparent to the app
Sharding and placement strategies should be separate

1.10.8.3 Horizontal Sharding Functions

Algorithms applied to the key (often: user ID) to create different groups (shards):

Numerical range: users 0-100000, 100001-200000, etc.
Time range: 1970-80, 81-90, 91-2000, etc.
Hash and modulo calculation
Directory-based mapping using a meta-data table for arbitrary mapping from key to shard

1.10.8.4 Consequences of Sharding

No more SQL JOINs, leading to lots of copied data
Increased need for partial requests for data aggregation
Expensive distributed transactions required for consistency (if needed)
Vertical sharding distributes related data types from one user, while horizontal sharding distributes related users from each other (bad for social graph processing)
SQL limitations due to mostly key/value queries and problems with automatic DB-Sequences
Every change requires corresponding application changes

1.10.9 Relationships

1.10.9.1 Overview

Within the database, referential integrity rules protect containment relationships
No equivalent in object space
No protection in distributed systems

For example when an employee leaves:

All rights are cancelled
Disc-space is archived and erased
Databases for authentication and application-specific DBs are updated
Badge no longer works
All equipment has been returned

1.10.9.2 Functional Requirements of Relationship Services

Definition of relations between objects without modifying those objects
Support for different types of relations
Ability to create graphs of relations
Ability to traverse relationship graphs
Support for reference and containment relations

1.10.9.3 Why Relationship Services Failed

The good:

Powerful modeling tool
Helps with creation, migration, copy, and deletion of composite objects
Maintains referential integrity

The bad:

Tends to create many small server objects
Performance impact
Not supported by many CORBA vendors for a long time
EJB only supported with local objects in the same container.

1.11 Distributed Services and Algorithms II

1.11.1 Overview

Classic (ACID) distributed consistency includes:
- Distributed 2P locking
- Distributed 2PC consensus
ACID 2.0 eventual (coordination-free) consistency includes:
- CAP and its children, CALM, CRDTs, etc.
- Distributed replication (e.g. Cassandra)
- CALM (Bloom) consistency
- CRDTs (Conflict-free Replicated Data Types)
Distributed Coordination (e.g. Chubby, ZooKeeper) includes:
- Distributed consensus protocols
- Cluster scheduler (e.g. Borg)

1.11.2 Problems with Classic Concurrency

1.11.2.1 Why Truth is Expensive

Strong consistency is discouraged.
Coordination and distributed transactions slow down the process and affect availability.
The cost of knowing the truth is high for many applications.
The truth might only be a partial or outdated version.
Availability is prioritized over consistency by making local decisions with available information.
Improves the user experience by making this trade-off, most of the time.

1.11.2.2 Aspects of Classic Distributed Consistency

Distributed Objects and Persistence: Objects that span across multiple systems and persist data in multiple locations.
ACID: Atomicity, Consistency, Isolation, Durability - a set of properties that guarantee that database transactions are processed reliably.
Transactions: A sequence of database operations that are executed as a single unit of work.
Isolation Levels: The level of isolation between concurrent transactions, specifying how one transaction affects another.
Two-Phase Locking: A protocol for enforcing serializable access to shared resources in a distributed system.
Distributed Transactions: Transactions that span multiple systems and persist data in multiple locations.
Two-Phase Commit (2PC): A protocol for ensuring that a transaction is committed in a consistent state in a distributed system.
Failure Models for 2PC: Models for how 2PC protocol handles system failures and ensures the consistency of transactions.

1.11.2.3 Persistent Object Representations

Real storage object lives in a data store and uses data store concepts for storage (e.g. a row in a table).
Service works with object representations (“Incarnations”) and provides the illusion of a persistent object to clients.
Java Connector Architecture provides an adapter interface for resource managers.

1.11.2.4 Mechanisms for Persistence

SQL Driver:

Used to store object state.
Suffers from “impedance mismatch”.
Needs to control locking etc. in the service.

Object Relational Mapper (EJB/Hibernate):

Used to store object state transparently for the programmer.
Inheritance creates difficult problems for table mapping: Either performance or flexibility suffer.

Just storing an object is simple - doing this in a way that protects from concurrent access, system failures, and across different data stores is much harder.

1.11.2.5 Persistent Object Mapping

Enterprise integration software specializes in this kind of mapping
Key to persistent mapping is meta-information:
- Generates object representations for a service.
- Generates code necessary for the data store to store the objects with its own mechanisms and objects.

1.11.2.6 Data Store Session Pooling

Number of channels to a data store is limited
If an object directly allocates a session (channel) and does not return it quickly, system throughput would become marginal
Session creation is expensive (security!)
Clients can either ask a pool for a session or the container framework can automatically allocate and return sessions
Problems:
- Timeouts
- Connection recycling

1.11.3 Locking

1.11.3.1 Locking Against Concurrent Access

Binary locks:

Used to synchronize an object, causing all clients except one to be blocked.
Limitations: Binary locks are simple to use, but their performance suffers as they cannot distinguish between reads and writes.

Modal locks (read/write locks):

Used to allow clients who only want to read to obtain read locks. Many concurrent read locks are possible.
Advantages: Modal locks allow for a more nuanced approach to concurrent access, improving performance by allowing multiple read operations to occur simultaneously.

Lock Granularity: The granularity of locks (the scope of the resources being protected by the lock) affects the overall throughput of a system. The smaller the lock granularity, the better the performance will be.

1.11.3.2 Optimistic Locking

Process:

Lock a row, read it along with its timestamp, and then release the lock.
Start a transaction
Write the data to the database.
Acquire locks for all data read and compare the data timestamps.
If one of them is newer, the transaction operation is rolled back, otherwise it is commited.

Advantages:

Better overall throughput as locks are held for only a short period of time
Timestamp comparison logic is implemented as a framework mechanism in the client session objects, simplifying the process

1.11.3.3 Serializability with Two-Phase Locking

Process:

Allocate all locks
Manipulate the data
Release all locks

Advantages: Requires that all locks be allocated before any data manipulation and released only after the manipulation is complete. Guarantees serializability.

1.11.3.4 Deadlocks

State where two or more processes are blocked because each one is waiting for resources held by the other
Results in a situation where the processes cannot continue to run and are stuck in a permanent waiting state
Can occur in concurrent systems where multiple processes access shared resources

1.11.3.5 Distributed Deadlock Detection

Local wait-for-graphs:

Correctness: Based on the definition of a wait-for-graph, this method correctly detects deadlocks by identifying cycles in the graph.
Liveness: This method can only detect deadlocks that exist within a single process or machine, so it may miss deadlocks in a distributed system.
Cost/complexity: The cost of implementing this method is relatively low, as it only requires tracking locks and resource requests within a single process.
Failure model: This method is susceptible to false negatives (missed deadlocks) in a distributed system.
Architecture type: This method is suitable for systems with a centralized architecture, where all locks and resource requests can be monitored by a single process.

Detection servers:

Correctness: Detection servers are designed to detect deadlocks in a distributed system, so this method should provide correct results if implemented correctly.
Liveness: This method is designed to detect deadlocks in a distributed system, so it should have better liveness compared to local wait-for-graphs.
Cost/complexity: The cost of implementing this method is higher than local wait-for-graphs, as it requires communication and coordination between multiple processes.
Failure model: This method is susceptible to false negatives if one or more detection servers fail, or if there are errors in the communication between the servers.
Architecture type: This method is suitable for systems with a decentralized architecture, where multiple processes are involved in the detection of deadlocks.

Distributed edge chasing algorithms:

Correctness: This method is designed to detect deadlocks in a distributed system, so it should provide correct results if implemented correctly.
Liveness: This method is designed to detect deadlocks in a distributed system, so it should have better liveness compared to local wait-for-graphs.
Cost/complexity: The cost of implementing this method is higher than local wait-for-graphs, as it requires communication and coordination between multiple processes.
Failure model: This method is susceptible to false negatives if there are errors in the communication between the processes.
Architecture type: This method is suitable for systems with a decentralized architecture, where multiple processes are involved in the detection of deadlocks.

Stochastic detection:

Correctness: The accuracy of this method depends on the parameters used, so it may provide incorrect results in some cases.
Liveness: This method is designed to detect deadlocks in a distributed system, so it should have better liveness compared to local wait-for-graphs.
Cost/complexity: The cost of implementing this method is relatively low, as it only requires monitoring resource requests and using randomization to make decisions.
Failure model: This method may miss deadlocks if the randomization parameters are not set correctly.
Architecture type: This method is suitable for systems with a decentralized architecture, where multiple processes are involved in the detection of deadlocks.

1.11.4 Transactions

1.11.4.1 Classic ACID Definitions

Durability: Ensures that once a transaction is committed, its effects persist even in the case of system failures (e.g. a crash that causes you to lose changes made to a word file)
Atomicity: Ensures that a transaction is treated as a single, indivisible unit of work that either happens in its entirety or doesn’t happen at all (e.g. in the case of a birthday party re-schedule where not all participants were caught in time)
Isolation: Ensures that the concurrent execution of transactions results in a system state that would be obtained as if transactions were executed serially (e.g. if two people work on a shared file, their changes should not interfere with each other)
Consistency: Ensures that the system remains in a valid state after a transaction is executed (e.g. after you complete a friend’s work for the day, the tasks remain consistent, and the system remains in a valid state)

1.11.4.2 Transaction Properties and Mechanisms

Atomic Changes over Distributed Resources: This is achieved through the use of consensus or voting algorithms such as two-phase commit.
Consistency: This is maintained by observing consistency constraints between objects, such that the system remains in a valid state before and after a transaction is executed.
Isolation from Concurrent Access: This is accomplished through the use of locking mechanisms, such as two-phase locking or hierarchical locking.
Durability of Changes: This is ensured by transferring changes made to memory objects to persistent storage, to prevent loss in case of a system failure.

1.11.4.3 Serializability and Isolation

Definition: States that the outcome of executing a set of transactions should be equivalent to some serial execution of those transactions.

Purpose: The purpose of serializability is to ensure that each transaction operates on the database as if it were running by itself, which maintains the consistency and correctness of the database.

Importance: Without serializability, ACID consistency is generally not guaranteed, making it a crucial component in maintaining the integrity of the database.

1.11.4.4 Transaction API

System starts in a consistent state
Begins transaction
Modifies objects

Commit transaction:

System has a new, consistent state
Local objects are now invalid
Changes are visible to others

On error: Rollback:

Either from system or from client
Only successful commit operations become the new state durable and visible to others
Means going back to the beginning completely
Client does not even know that they tried an operation
Log files would have to be cleaned.

1.11.4.5 Components of Distributed Transactions

Process

Begin()
Commit()
Rollback()

RPCs:

Register (transactional servers)
Vote (objects)
Commit/rollback (objects, resource managers)
Read/write/prepare (resource managers)

Components:

Transaction
Transactional client
Transactional servers (objects)
TACoordinator
XA resource managers

1.11.4.6 Service Context

Some services require context information to flow with a call
Security: Needs to flow user information, access rights, etc.
Transactions: Needs to flow information about ongoing transactions to participants
The additional information needs to be standardized to allow different vendor implementations of services to interoperate.

1.11.4.7 Distributed Two-Phase Commit

Vote:

To achieve atomic operations in a distributed setting, the TA-Coordinator asks all participants for their vote on committing or rolling back.
Upon receiving a commit() call from a client, objects part of the TA vote by asking resource managers (e.g. databases) to prepare for the commit.
A successful return of “prepare” from resource managers means that both the object and the resource manager have promised to commit the changes if the coordinator sends a commit.

Completion:

The coordinator is the only entity that can commit or abort a TA after the prepare phase.
If the vote phase was successful and all participants have prepared for a commit, the coordinator calls for a commit.
In case of an error (e.g. unreachable participant), the coordinator calls for a rollback.

1.11.4.8 Failure Models of Distributed Transactions

Work Phase:

If a participant crashes or becomes unavailable, the coordinator calls for a rollback.
If the client crashes before calling commit, the coordinator will timeout the TA and call for a rollback.

Voting Phase:

If a resource becomes unavailable or has other issues, the coordinator calls for a rollback.

Commit Phase (Server Uncertainty):

In case of a crashed server, it will consult the coordinator after restart and ask for the decision (commit or rollback).

1.11.4.9 Special Problems of Distributed Transactions

Resources:

Participants in distributed TA’s consume many system resources due to logging all actions to temporary persistent storage.
Large parts of the system may become locked during a TA.

Coordinator as a Single Point of Failure:

The coordinator must also prepare for a crash and log all actions to temporary persistent storage.

Heuristic Outcomes for Transactions:

In certain circumstances, the outcome of a transaction may only be determined heuristically if the real outcome cannot be determined.

1.11.4.10 Transaction Types

Flat Transactions:

Characterized by all-or-nothing behavior.
Any failure causes complete rollback to original state.
Can result in loss of significant amount of work if many objects have been handled.

Nested Transactions:

Allow partial rollbacks with a parent transaction.
Child TA rollback doesn’t affect parent TA, but parent TA rollback returns all participants to initial state.
Example: Allocation of a travel plan (hotel, flight, rental-car, trips, etc.).

Long-running Transactions:

Challenge is resource allocation and increasing amount of work lost in case of rollback.
Syncpoints move the fallback position closer to the commit point.

Compensating Transactions:

Improves transaction throughput by making objects visible sooner, at the cost of sacrificing the ISOLATION property.
Require compensation for previous TA which can no longer be rolled back.
Depend on the application whether compensating transactions are possible.
Can be hand-coded if no transaction monitor/manager is available.

1.11.4.11 ANSI Transaction Levels

Problems:

Dirty reads: Occurs when a transaction reads data written by another concurrent transaction that has not yet been committed.
Non-repeatable reads: Occurs when a transaction re-reads data it has previously read and finds that the data has been modified by another transaction that has since committed.
Phantom reads: Occurs when a transaction re-executes a query returning a set of rows that satisfies a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.

Transaction Levels:

Read Uncommitted:
- Prevents: Nothing
Read Committed:
- Prevents:
  - Dirty reads.
Repeatable Read:
- Prevents:
  - Dirty reads
  - Non-repeatable reads
Serializable:
- Prevents:
  - Dirty reads
  - Non-repeatable reads
  - Phantom reads

The higher the level, the more overhead is required.

1.11.5 Distributed Filesystems

1.11.5.1 Filesystem Block Order Guarantees

BOB (Block Order Breaker) is a tool used to evaluate the behavior of modern file systems in regards to building crash consistent applications.
It tests the order guarantees of blocks in file systems.

1.11.5.2 Eventually Consistent Storage Systems

Consistency without Coordination: The design of modern data management systems has been impacted by the rise of Internet-scale geo-replicated services.
Weak Alternatives: To reduce the cost of expensive coordination, many systems have sought weaker alternatives that still ensure application integrity.
Cost of Coordination: Classic mechanisms like serializable transactions come with associated availability, latency, and throughput penalties.
When to Forego Coordination: The question of when it’s safe to forego the cost of expensive coordination versus when it’s necessary to pay the price is an important one.

1.11.5.3 Grid Storage vs. NAS/SAN

Grid Storage:

Requires POSIX-Grid gateway
Special caching may be needed for video (readahead)
Offers huge bandwidth and scalability
May require special maintenance
May be proprietary
Supports parallel processing
Requires special applications
Compatibility with existing apps is questionable
Disaster recovery across sites is unclear
Requires more electric power and space

NAS/SAN:

POSIX-compatible
Special caching is difficult to implement
Has a hard limit in SPxx storage interface
Offers simple upgrades
Supports standard file system
Allows dynamic growth of file systems via LUN organization
Requires maintenance effort to balance space/use
Proven and fast technology
Expensive disaster recovery through smaller replicas
Several different file system configurations are possible
Without virtual SAN, hot-spots can occur on a single drive
Longer drive-rebuild times

To summarize:

Watch out for proprietary lock-in and compatibility issues with grid storage
Grid storage requires programming skills for solutions (map/reduce with Hadoop)
NAS/SAN is a proven and faster technology
NAS/SAN won’t be replaced by grid storage (which is specialized)

1.11.6 NoSQL

1.11.6.1 Forces behind NoSQL

Need for low-latency and high-throughput access to data
Difficulty in managing and maintaining consistency in a distributed system
Increased focus on scalability and flexibility
Changing data requirements and needs for real-time processing
Cost and complexity of traditional RDBMs in large-scale systems
Inability of RDBMs to handle large amounts of unstructured data
The need for horizontal scaling in storage
Lack of support for real-time, complex data processing using RDBMs
The need for automatic scaling in storage to keep up with rapidly growing data
Relaxed data consistency requirements in some applications.

1.11.6.2 Scaling Problems of RDBMs

Poor time complexity of SQL joins: O(m+n) or worse
Difficulty in horizontally scaling, resulting in loss of joins or jumping between nodes
Unbounded nature of queries, which can lead to a single query overloading a database
Optimized for storage efficiency (no duplicates), integrity, and flexibility of access through arbitrary joins.

1.11.6.3 NoSQL Design Patterns

Use partition keys with many distinct values for better scalability and data distribution.
Opt for a single table design with hierarchical modeling and de-normalization to simplify the data structure.
Ensure that values are evenly requested to avoid hot spots.
Utilize composite secondary keys for 1:n and n:n queries.
Limit query responses with paging token for better performance.
Consider the use case and access patterns before finalizing the data layout.
Avoid relational modeling and instead focus on simplifying the data structure.
Data integrity is an application concern and should be handled by the application logic.
Data storage efficiency is not a primary concern.

1.11.6.4 DynamoDB Design Principles

Decentralized design with no single point of failure (no master node)
Supports heterogeneous hardware
Symmetric peers for better scalability
Incrementally scalable to handle increasing load
Eventually consistent data replication
Requires a trusted environment for data security
Replication support for higher data availability. Always-write enabled with conflict resolution during read
Multi-version store with conflict resolution policies for better data management

1.11.7 Beyond Relaxed Consistency

1.11.7.1 Overview

Order-insensitive processing using CALM (Consistency as Logical Monotonicity) principles in EC (Eventual Consistency) programs
Converging replicated data types (CRDTs) divided into two types:
- State-based CRDTs
- Operation-based CRDTs

1.11.7.2 CALM Principle

“Consistency as Logical Monotonicity”
Links consistency with logical monotonicity, where monotonic programs ensure eventual consistency regardless of the order of delivery and computation.
Monotonic programs do not require coordination, unlike non-monotonic programs where adding an element to the input set can revoke a previously valid output.
Non-monotonic programs require coordination schemes that wait until inputs are complete before proceeding.

1.11.7.3 CALM Operations

Logically Monotonic:

Initializing variables
Accumulating set members
Testing a threshold condition

Non-monotonic:

Overwriting variables
Set deletion
Resetting counter
Negation

1.11.7.4 CRDTs

State-based CRDTs:

Calculate the new result at one node and then propagate it to replicas.
The data structure must be commutative, associative, and idempotent, e.g., sets.

Operation-based CRDTs:

Send the requested operation to each replica and calculate the results locally.
The operations must be commutative with “exactly once” semantics (idempotent) and in FIFO order.
These delivery guarantees are difficult to achieve, making state-based CRDTs more popular currently.

1.11.7.5 Bending the Problem

Separates data store and application-level consistency concerns.
CALM, ACID 2.0, and CRDT appeal to higher-level consistency criteria in the form of application-level invariants.
Instead of requiring strong consistency for every read and write, the application only needs to ensure semantic guarantees (e.g., “the counter is strictly increasing”).
This grants more flexibility in how reads and writes are processed.

1.11.7.6 Examples of CRDTs

Counters:

Grow-only counter: Merge operation is max(values), payload is a single integer
Positive-negative counter: Consists of two grow counters, one for increments and another for decrements

Registers:

Last Write Wins register: Uses timestamps or version numbers, merge operation is max(ts), payload is a blob
Multi-valued register: Uses vector clocks, merge operation takes both values

Sets:

Grow-only set: Merge operation is union(items), payload is a set, no removal is allowed
Two-phase set: Consists of two sets, one for adding and another for removing, elements can be added once and removed once
Unique set: Optimized version of the two-phase set
Last write wins set: Merge operation is max(ts), payload is a set
Positive-negative set: Consists of one PN-counter per set item
Observed-remove set

1.11.7.7 Distributed Coordination

Features:

Configuration changes and notifications
Updates for failed machines
Dynamic integration and deconfiguration of new machines
Elastic configuration with partial failures
API for watches, callbacks, automatic file removal, and triggers
Simple data model (directory tree model)
High performance and highly available in-memory cluster solution
No locks for updates, but total ordering of requests for all cluster replicas
All replicas answer reads
Wait-free implementation of coordination service with client API performing locks, leader selection, etc

Liveness and Correctness:

Sequential Consistency: Updates from a client will be applied in the order they were sent.
Atomicity: Updates either succeed or fail. No partial results.
Single System Image: A client will see the same view of the service regardless of the server it connects to.
Reliability: Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness: The client’s view of the system is guaranteed to be up-to-date within a certain time bound.

1.11.8 Distributed Coordination Services

1.11.8.1 Zookeeper API

create: Creates a node at a specified location in the tree
delete: Deletes a node from the tree
exists: Tests if a node exists at a specified location in the tree
get data: Retrieves the data stored at a node
set data: Writes data to a node
get children: Retrieves a list of children of a node
sync: Waits for data changes to be propagated to all nodes in the cluster.

1.11.8.2 Primary-Order Atomic Broadcast with Zab

Primary sends non-commutative, incremental state changes to backup units
Order of incremental changes maintained even in case of primary crash
Multiple outstanding requests possible
Identification scheme to prevent re-ordering of updates
Synchronization phase to ensure old updates delivered before new ones stored.

1.11.8.3 Consistency Requirements for ABCast (Reliable Ordered Atomic Broadcast)

Validity: If a correct process broadcasts a message, all correct processes will eventually deliver it.
Uniform Agreement: If a process delivers a message, all correct processes will eventually deliver it.
Uniform Integrity: Every process delivers a message at most once, only if it was previously broadcast by sender.
Uniform Total Order: If processes p and q both deliver messages m and m0, their order must be the same.

1.11.8.4 Primary Order

Local primary order: If primary broadcasts (v, z) before (v’, z’), process that delivers (v, z) must have delivered (v’, z’) before (v, z).
Global primary order: If Pi broadcasts (v, z) and Pj > Pi broadcasts (v’, z’), process delivering both (v, z) and (v’, z’) must deliver (v, z) first.
Primary integrity: If Pe broadcasts (v, z) and some process delivers (v’, z’) broadcast by Pe’ < Pe, Pe must have delivered (v’, z’) before broadcasting (v, z).

1.11.8.5 HA Transactions

Provide transactional guarantees without unavailability during system partitions or high network latency (Non-failing replica must respond)
Not CAP: Can’t provide linearizability as reading the most recent write from a replica
Not HAT-compliant: Serializability, Snapshot Isolation, Repeatable Read Isolation
Possible with algorithms relying on multi-versioning and client-side caching: Read Committed Isolation, transactional atomicity, etc.
Causal consistency with phantom prevention and ANSI Repeatable Read need affinity with at least one server (sticky sessions)
Unable to prevent concurrent updates to shared data items, cannot provide recency guarantees for reads.

1.12 Design of Distributed Systems

1.12.1 Overview

Key principles for system design
Utilizing caching and replication for efficient operations
Importance of architecture in optimizing performance
Validation of architectural design
Techniques for improving performance in large fan-out architecture
Strategies for achieving fault-tolerance in high-scale systems.

1.12.2 Design Principles for Distributed Systems

1.12.2.1 Overview

Consideration of Latency: Examination of buffering and round-trip times
Importance of Locality: Proper placement of heavily interacting components
Avoiding Duplication of Work: Utilizing resources effectively
Resource Pooling: Reusing resources in communication such as connections or thread pools
Parallelization: Design for concurrent operations and minimize serialization
Evaluating Consistency: Determining the appropriate level of consistency with caching and replication
Caching and Replication Strategies: Utilizing prediction and bandwidth to reduce latency
End-to-end Argument: Minimizing heavy guarantees at lower levels of the system.

Know Your No. 1 Enemy: Latency!

Pooling resources can improve performance even in local systems
High-frequency requests can lead to memory allocation issues and poor performance
Caching is crucial for the effectiveness of distributed applications
Minimizing backend requests while maintaining sane application logic
Breaking down information into smaller fragments can reveal reusable parts

1.12.2.3 Connection Pooling

Matching server and database CPU capabilities
Avoiding blocking and app threads holding onto connections
Careful monitoring of wait time in the pool
Checking I/O rates with new hardware
Understanding what constitutes a “connection” to storage
Monitoring core/thread ratio, etc.

1.12.2.4 Horizonal Scaling/Parallelization

Horizontal scaling through parallel processing
Every request can be handled by any thread on any host
Avoid synchronization points in servlet engines or database connections.

1.12.2.5 Caching and Replication

Caching components are responsible for maintaining data validity
Data source is responsible for keeping replicas consistent and up-to-date
Focus on reducing back-end requests for improved efficiency.

1.12.2.6 End-to-End Argument

User/Developer: Compensation for behavior through application
Application Layer: Use of special commands, such as “Select for Update” or “Begin Transaction”
Intermediate Layer: Compiler/Languages utilizing technologies such as Software Transactional Memory and memory models
Base Layer: Considerations for CPU cache coherence, database isolation levels, and real-time streaming, etc.

1.12.2.7 Design Methodology

Back-of-the-envelope calculations
Decide on geographical distribution and replication strategy
Determine data segregation, including single or multi-tenancy models and partitioning
Divide business requirements into REST-like services
Define SLAs for services, including availability, latency, throughput, consistency, and durability
Define security context with IAAA (Identity, Authentication, Authorization, Audit) and perform risk analysis
Complete monitoring and logging setup
Plan for deployment, release changes, testing, and maintenance using fault-tolerant features.

1.12.2.8 Uncomfortable Real-World Questions

How many application servers are needed to support the customer base?
What is the optimal ratio of users to web servers?
What is the maximum number of users per server?
What is the maximum number of transactions per server?
Which specific hardware configurations provide the best performance?
What is the current production server capability?
What do the users do? (These are business process definitions.)
How fast do the users do it? What are the transaction rates of each business process?
When do they do it? What time of day are most users using it?
What major geographic locations are they doing it from?
How many connections can the server handle?
How many open file descriptors or handles is the server configured to handle?
How many processes or threads is the server configured to handle?
Does it release and renew threads and connections correctly?
How large is the server’s listen queue?
What is the server’s “page push” capacity?
What type of caching is done?

1.12.3 Architecture Fields

1.12.3.1 Overview

Information Architecture
Distribution Architecture
System Architecture
Physical Architecture
Architectural Validation

1.12.3.2 Information Architecture (to analyze Caching)

Defines pieces of information to aggregate or integrate

Data/changed by	Time	Personalization
Country Codes	No (not often, reference data)	No
News	Yes (aging only)	No, but personal selections
Greeting	No	Yes
Message	Yes (slowly aging)	Yes

1.12.3.3 Distribution Architcture

Tells portal how to map/locate fragments defined in the information

Data Type	Source	Protocol	Port	Avg. Resp.	Worst Resp.	Downtimes	Max Conn.	Loadbal.	Security	Contact/SLA
News	hostX	http/xml	3000	100ms	6 sec.	17.00-17.20	100	client	plain	Mrs.X/News-SLA
Research	hostY	RMI	80	50ms	500ms	0.00-1.00	50	server	SSL	Mr.Y/res-SLA

Additional factors to consider:

Available bandwidth
Number of planned requests
Distance to the device
Availability numbers

Example results:

Back-end server performance affecting home page construction time
Huge latencies and variations in response times causing instabilities in the portal application
Getting to the servers for every request is nearly impossible due to huge latencies and variations in response times from dependencies.

1.12.3.4 Service Access Layer

Is determined by the distribution architecture.

Handle changes in the interface
Monitors backend system connections
Disable connections that are not functioning properly (“fail fast”)
Add new sources to the system
Poll and re-enable sources that have been temporarily disabled
Keep track of statistics on all sources.

Simple Alternative: Sidecar, contains circuit breaker & service discovery

Advanced Alternative: Service mesh with separate data and control plane

1.12.3.5 Physical and Process Architecture

Physical Architecture:

Deals with reliability issues (replication, high-availability, etc.) and scalability (horizontal and/or vertical)
Need to define scalability methods from the beginning due to their impact on overall system architecture

Horizontally scalable application:

Replicated on multiple hosts
Avoids single point of failure

Vertically scalable application:

Can only install more CPUs or RAM on single instance of host
Limited scalability and availability (HA application)

1.12.3.6 Architecture Validation

In the architecture validation phase these questions are answered: How does the architecture …

Handle security and privacy?
Handle data consistency and durability?
Handle disaster recovery and business continuity?
Handle performance, scalability and capacity?
Handle integration with other systems and data sources?
Handle upgrades, maintenance and support?
Align with the organization’s goals, strategies and plans?

1.12.4 Improving an Existing Design

1.12.4.1 Problems with Naive Application Designs

The portal had no caching etc.

GUI architecture: Long time with empty page
System architecture: No System Architecture Diagram!
Performance: Very slow construction of home-page
Reliability: Frequent stalls and crashes of the application
Throughput: 10 users max. with top-notch hardware!
Team: Little understanding of performance or architecture

1.12.4.2 Improving the GUI Architecture

Minimize HTTP requests by combining files and using sprites
Use asynchronous loading for non-critical resources
Avoid heavy use of animations and dynamic effects
Preload essential resources
Use a content delivery network (CDN) for static resources
Lazy load images and videos
Use browser caching effectively
Minimize the use of plugins and third-party scripts.
Minimize the number of DOM manipulations
Use lazy loading for images and other resources
Avoid using large, blocking scripts
Use a content delivery network (CDN) to distribute resources
Consider using server-side rendering or progressive web apps (PWA)
Implement caching strategies at the server and client side
Monitor and optimize the page load time regularly
Use performance optimization tools and browser dev-tools to identify bottlenecks.

1.12.4.3 Improving the System Architecture

Use a system architecture diagram:

Show the main components and their interactions
Indicate the flow of data and processing
Highlight the main functionalities and relationships between components
Reflect the overall structure and design of the system
Provide a clear and concise representation of the system architecture.

1.12.4.4 Lessons Learned from Naive Portal Designs

Request time is affected by the sum of individual calls
Each delay in a call contributes to the runtime
Back-end server problems lead to poor request time
Long timeout settings negatively impact response times
Small improvements in sub-requests matter with many concurrent requests
Lack of central architecture diagram limits understanding of performance impact and throughput
JEE restrictions prevent creating custom threads.

1.12.5 Fan-Out Architecture

1.12.5.1 Overview

Calls are parallel instead of serial.

The overall request time is determined by the slowest sub-request
Each delay in an individual call adds to the runtime
Long timeout settings negatively impact response times
Using short timeouts for back-end server calls is recommended
Running short requests in separate threads may not be productive, consider request bundling
Error from one sub-call should not block the whole request, have a fallback
Avoid all threads getting stuck on a dysfunctional sub-call (bulkhead)
Temporarily close dead connections (circuit-breaker).

1.12.5.2 Reliability Issues in Dependencies

System load becomes worse due to hanging requests occupying resources and leading to heavy garbage collection
Dead servers can cause a buildup of threads due to even short timeouts
The portal was frequently impacted by failing back-end servers
Avoid lengthy waiting time for sub-requests in the homepage action handler: Adopt the “Fail-fast” pattern today.

1.12.5.3 Fragments

Pages: Unique to customers, cannot be re-used

Page fragments:

Can be shared and heavily re-used
Allows huge reduction in back-end requests
Downside: If fragments change, mechanism needed to invalidate dependent pages.

1.12.5.4 Latency Reduction & Tolerance

Keep response times tight but aware of stragglers
Fight stragglers with backup requests and cross-server cancellation
Watch for overload at sender when responses come back
Do NOT distribute load evenly, synchronize background load across machines instead
Reduce head-of-line blocking (partition large requests)
Partition data across machines
Cheat by coming back with partial data
Cross request adaptation
Increase replication count
Beware of the incast problem

1.12.5.5 Avoiding Getting Stuck

Fail Fast: Don’t wait for problematic resources
Timeouts: Use timeouts when accessing a service
Exponentially Decreasing Retries: Use if needed
Fallback: Use alternatives when service doesn’t work, such as serving stale data
Caching: Retrieve data from cache if real-time dependency is unavailable, even if data is stale
Eventual Consistency: Queue writes to be persisted once dependency is available
Stubbed Data: Revert to default values if personalized options can’t be retrieved
Empty Response (Fail Silent): Return null or empty list that UIs can ignore.

1.12.6 Containing Failures

1.12.6.1 Circuit Breakers

Purpose: Handle faults that might take a long time to recover from
Provide control mechanism to prevent application from continually trying to perform a failing operation
Allows application to fail fast and respond to failures quickly
Acts as a switch that “trips” when system detects a failure
Stops application from making further attempts to perform operation until reset
Helps prevent application from becoming unresponsive
Protects other parts of the system from being affected by the failure.

1.12.6.2 Bulkheads

Bulkhead pattern is a design for fault-tolerant applications
Elements of an application are isolated into pools
If one pool fails, others will continue to function
Named after the sectioned partitions (bulkheads) of a ship’s hull
Example: semaphores and thread pools

1.12.6.3 Blast Reduction

Partition app into geographical regions (e.g. US, DACH etc.)
Splitting regions further into specific availability zones and further cells
Shuffle sharding: Provide a single-tenant-like isolation for shared workloads
Splitting app itself into separate control and data planes