Jekyll2018-10-21T01:56:50+00:00http://historicalmodeling.com//Historical ModelingAn analysis and design technique for building distributed systems.
Michael L PerryThe New Architecture2018-05-13T00:00:00+00:002018-05-13T00:00:00+00:00http://historicalmodeling.com/journey/2018/05/13/new-architecture<p>In <a href="/journey/2018/04/28/pluralsight-to-production.html">the first installment</a> of The Learning Journey, I described a system. The original architecture of that system appears in Figure 1. The new architecture is in Figure 2.</p>
<p><img src="/images/ep02_fig01.png" alt="The old architecture" /></p>
<p>Fig.1: The old architecture that came with several issues which I explained in episode 1.</p>
<p><img src="/images/ep02_fig02.png" alt="The new architecture" /></p>
<p>Fig.2: The new architecture.</p>
<p>Looking at Fig.1 and Fig.2, let’s find the differences between the old and the new architecture and, more importantly, the consequences:</p>
<p>The fundamental change in the new architecture is that our data structure/storage follows the Historical Modeling concept. This decision has very important consequences:</p>
<ul>
<li>Because of immutability, the synchronization of the data now becomes so much easier. And we don’t need any closed-source external technology/components anymore, so we can get rid of the old and abandoned MS Sync Framework.</li>
<li>With the MS Sync Framework gone, the need for the SQL Server database (and possible licensing issues once we outgrow the express edition) is gone as well.</li>
<li>The console application (the app for the dispatchers, remember) gets its own local database, which will be synchronized with the rest of the system. Hence, increased reliability, improved response time, decreased server load and the console can be used off-line “on the road”.</li>
<li>Keeping history is no longer difficult as it is a natural consequence of using Historical Modeling. So if we want to keep a customer’s new address as well as its old address, no problem (except perhaps for the new GDPR legislation in Europe).</li>
<li>Even being offline one can make decisions. So as well the workers as the dispatchers can safely make corrections to previous input, even if they are offline. If ever multiple people make conflicting corrections or changes, these conflicts can be resolved later by anyone at any time. And the fact that there was a conflict and how it was resolved will always be available for traceability.</li>
<li>The central database server with application specific schema can be replaced by an application independent message store/distributor. Hence, very loose coupling which should allow us to upgrade the system relatively easy.</li>
</ul>
<p>All my wishes fulfilled? Not yet. The old architecture had one very important shortcoming that new architecture doesn’t solve (yet). I even forget to mention that one in the first episode, but reality once again reminded me about it last week. We have a single point of failure: the central database server in the old architecture, and the distributor in the new architecture. It would be nice if we could have a continuous data backup of the messages on the distributor. It would even be better if we could have a cold standby distributor. And the ultimate would be to have a hot standby distributor in a different data centre. I have been thinking about it for a while now: instead of keeping the messages on the clients queue until the distributor has saved the message, we could keep them on the clients queue until at least two distributors have saved the message. However the bookmark-related counter on each distributor complicates things. I think we need Michael’s input on this one… Michael HELP.</p>
<p><strong>Note from Michael</strong>: Not to worry. This problem has solutions. Let’s discuss some of the options, and I’ll provide documentation for the readers.</p>
<p>So the basic architecture is clear, now it’s time for the real work. In the next episode we’ll start building our historical model.</p>
<p>Want to prepare? Have a look at <a href="https://www.youtube.com/watch?v=NW0-gXAoPG4">How Not to Destroy Data</a>. For those who have a Pluralsight subscription and finished <a href="https://app.pluralsight.com/library/courses/occasionally-connected-windows-mobile-apps-collaboration">the Collaboration course</a>, <a href="https://www.pluralsight.com/courses/occasionally-connected-windows-mobile-apps-lob">Occasionally Connected Windows Mobile Apps: Enterprise LOB</a> might be of interest to you.</p>JanIn the first installment of The Learning Journey, I described a system. The original architecture of that system appears in Figure 1. The new architecture is in Figure 2.Historical Modeling: from Pluralsight to Production2018-04-28T00:00:00+00:002018-04-28T00:00:00+00:00http://historicalmodeling.com/journey/2018/04/28/pluralsight-to-production<p>A couple of years ago I developed a mobile Android application for field-workers. For dispatching I made a Windows application (the console) that allowed the dispatchers to plan and send tasks to the workers mobile phones. The workers could view their planning and register their activities on their phones, including parts and materials, which were then synchronised back to the dispatcher for further processing and eventually for invoicing.</p>
<p>This setup was build around a central SQL Server Express database running on a hosted server. The console (the dispatcher’s Windows application) accessed this SQL database directly over the internet. The workers’ phones had their own local database (SQLite) which synchronized with the SQL Server database using a 3G/4G internet connection. This synchronisation was built on Microsoft’s <a href="https://msdn.microsoft.com/en-us/library/bb902854(v=sql.110).aspx">Sync Framework</a> toolkit in combination with an old version of the open sourced <a href="https://github.com/SelvinPL/SyncFrameworkAndroid">SyncFrameworkAndroid</a> made by <a href="https://github.com/SelvinPL">Selvin</a>.</p>
<p>The system is running fine in production with several customers, however as the customers base is increasing and customers keep asking for enhancements, I’m running into troubles with this setup.</p>
<h3 id="a-few-of-the-problemslimits-i-encountered">A few of the problems/limits I encountered</h3>
<ul>
<li>Although the Sync Framework Toolkit v4.0 was open sourced by Microsoft, the Sync Framework core library, which contains the magic, is still closed source and Microsoft seems to have abandoned the project.</li>
<li>The Sync Framework is synchronising state, which makes the synchronisation process rather complex with a lot of edge cases.</li>
<li>Every schema change must be implemented as well on the server as on the apps. This results in a very tight coupling and hence a very difficult process each time I want to install a new feature that requires a database schema change.</li>
<li>The workers register their activities on their mobile phone by simply pushing the corresponding icon (work, transport, pause, private, …) the moment they start the activity. However sometimes they forget to push, or they push the wrong icon, and hence they have to correct the registration afterward. For traceability this correction should not override the original, so we have to keep history. Did someone mention “Temporal Databases”? Temporal databases is relatively new and relatively complicated stuff. What if we combine this with the complexity of state-based synchronisation? I don’t even want to think about it!</li>
<li>So workers can apply minor corrections them self, however if the correction is more complicated it might be more efficient to call the dispatcher and let him handle the correction on his big screen. But what happens if the worker and the dispatcher both apply conflicting corrections when the mobile phone is offline? Indeed, conflicts!!</li>
<li>So far for lookup-tables (customers, materials, locations, …), if the data changes, we override the current version with the new version. I know, a lot of system work that way, but if we want to do it correctly and avoid nasty surprises, we have to keep history.</li>
<li>Having the console access the central SQL Server database directly has 2 major negative consequences: 1. The console needs a reliable internet connection, so mobile usage is not possible. 2. The console puts a heavy burden on the server and on the internet connection, hence scalability is limited.</li>
<li>SQL Server Express is free but it is limited (e.g. database size). Once your system outgrows these constraints, your in for a heavy license-fee.</li>
</ul>
<p>So these are some of the issues I want to be solved. In my next episode we will get into a new architecture based on Historical Modeling and see how that can improve the system.</p>
<p>In meanwhile if you want a short introduction to Historical Modeling, start with the video <a href="https://www.youtube.com/watch?v=ptVJTrJ8mQE">What is Historical Modeling</a>. If however you prefer a deep dive, I recommend you start with Michael’s course at PluralSight: <a href="https://app.pluralsight.com/library/courses/occasionally-connected-windows-mobile-apps-collaboration">Occasionally Connected Windows Mobile Apps: Collaboration</a>.</p>JanA couple of years ago I developed a mobile Android application for field-workers. For dispatching I made a Windows application (the console) that allowed the dispatchers to plan and send tasks to the workers mobile phones. The workers could view their planning and register their activities on their phones, including parts and materials, which were then synchronised back to the dispatcher for further processing and eventually for invoicing.The CAP Theorem2011-02-08T18:00:00+00:002011-02-08T18:00:00+00:00http://historicalmodeling.com/distributed-systems/cap-theorem<p>Eric Brewer is an expert in distributed systems. In the Principles of Distributed Computing 2000 keynote address, he gave us the CAP Theorem. It states that a distributed system cannot simultaneously guarantee these three attributes:</p>
<ul>
<li>Consistency</li>
<li>Availability</li>
<li>Partition Tolerance</li>
</ul>
<p>It can guarantee at most two. Which two you choose should depend upon the system’s architectural requirements.</p>
<h2 id="consistency">Consistency</h2>
<p><img src="/images/consistency.png" alt="Consistency" /></p>
<p>Consistency in a distributed system is not strictly the same as ACID consistency. A distributed system is consistent if a read at any node returns data that is no older than that written by a previous write. The read and the write may occur at the same node or at different nodes. The nodes may use any algorithm they wish to keep each other up-to-date. But if I write version 2, a consistent system will never again read version 1.</p>
<p>There are many ways to guarantee consistency. One would be to block writes until all nodes have been notified. Another would be to block reads until all nodes are consulted. Yet another is to delegate one node the master of that particular piece of data, and route all messages to it. In practice, consistent distributed systems use a combination of these algorithms.</p>
<h2 id="availability">Availability</h2>
<p><img src="/images/availability.png" alt="Availability" /></p>
<p>A distributed system is available if any non-failing node responds to a request in a reasonable amount of time. It doesn’t mean that nodes can’t fail. It just means that, whatever other guarantees the system offers, it will respond when you address one of the remaining nodes.</p>
<p>You can see the tension between consistency and availability. To guarantee both, we need redundancy. What if the data that we try to read was only stored on the node that was lost after the write? That data would not be available, and we could not guarantee consistency.</p>
<h2 id="partition-tolerance">Partition Tolerance</h2>
<p><img src="/images/partition_tolerance.png" alt="Partition tolerance" /></p>
<p>A distributed system is partition tolerant if it can tolerate the loss of any number of messages. If enough messages are lost between islands of nodes, the network has been partitioned.</p>
<p>Network partitioning happens most often in wide area networks. A client disconnects from the internet. Or the connection between two data centers is severed. But network partitioning can happen in a local area network. No network, no matter how expensive, can guarantee that all packets are delivered. It is up to the designer of the distributed system to decide whether the system will tolerate message loss.</p>
<p>Most distributed systems respond to momentary message loss by resending messages. But since the network cannot guarantee message delivery, even those retries might be lost. It’s how the system responds to maintained message loss that determines whether it can guarantee partition tolerance.</p>
<h2 id="proof">Proof</h2>
<p>The <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf">proof of the CAP Theorem</a> is pretty simple. Informally, we can ask: how can a system guarantee both consistency and partition tolerance? If all messages between two given nodes are lost, how can a write at one affect a read at the other? No matter what algorithm you come up with, the only way to guarantee both consistency and partition tolerance is to give up availability. When messages between two nodes are lost, you must fail the read as if the second node was down. There is no way to respond with consistent data.</p>
<p>Or you can ask how a system can guarantee both consistency and availability. Remember our three example algorithms. If we block writes until every available node is notified, and those notifications are lost, then we must fail the write. If we block reads until every available node is consulted, and those messages are lost, we must fail the read. And if we delegate to the master, and that message is lost, then we fail as well.</p>
<p>And finally, we can guarantee both availability and partition tolerance, but we have to relax our consistency guarantee. Nodes might be down and messages might be lost, but if we assume that those problems will eventually be solved, then we can say that we will eventually be consistent. It is possible to read stale data from such a system, but given enough time all nodes will be up-to-date.</p>
<p>When designing a distributed system, consider the guarantees that the problem demands. But also consider the guarantees that you will be unable to make, and decide how best to respond in those situations.</p>
<h2 id="eventual-consistency">Eventual consistency</h2>
<p>Most of the modern distributed systems frameworks have opted to relax the consistency guarantee. They instead promise “eventual consistency”, or that you will read the new value if you wait long enough. More formally, this guarantee states:</p>
<ul>
<li>All writes are durable.</li>
<li>Once a version has been read, a later read will not return an earlier version.</li>
</ul>
<p>All writes are durable. No data will be lost. Data might, however, be overwritten by a later write.</p>
<p>Once a write becomes visible at a particular node, that node will no longer return an earlier version. It will never go back in time to a state before that write completed.</p>
<p><img src="/images/eventual_consistency.png" alt="Eventual consistency" /></p>
<p>Consider a timeline of writes. We collect that timeline fully ordered at the node where the writes occurred. Those writes produce a series of state changes. By observing the state at that node, we can detect whether a write took place.</p>
<p>Now, allow that stream of writes to move to another node. It causes a similar series of state changes there. If we observe the state at the second node, it might be earlier than the state at the first. Consistency is not guaranteed. But, once the second node catches up, it will not go back. It is eventually consistent.</p>
<h2 id="historical-modeling">Historical Modeling</h2>
<p>Like many other frameworks, Historical Modeling guarantees eventual consistency. It does so by transmitting historical facts from one node to another.</p>
<p>A historical fact is a record of a decision or state change that occurred at one node. All writes in a historical system are creations of new facts. Facts are never modified or destroyed.</p>
<p>If you observe the history of facts at a target node, you might find that the fact you just wrote at the source is not yet there. Consistency is not guaranteed. This allows the target node to remain available even if the network that it shares with the source is partitioned.</p>
<p><img src="/images/historical_eventual_consistency.png" alt="Historical eventyal consistency" /></p>
<p>When the fact is eventually shared with the target node, the transmission includes its fields and predecessors. These fields and predecessors uniquely identify the fact and distinguish it from others. In this way, an observer can recognize the target fact as the same as the source.</p>
<p>Predecessor facts must be transmitted first, so predecessors will always be present. Successors, however, will arrive eventually. Once they do, they will never be deleted. An observer can query for successors to determine the current state of the system. He will find that the state will never go backwards to a time when the new facts did not exist.</p>Michael L PerryEric Brewer is an expert in distributed systems. In the Principles of Distributed Computing 2000 keynote address, he gave us the CAP Theorem. It states that a distributed system cannot simultaneously guarantee these three attributes:Guarantees2011-01-28T18:00:00+00:002011-01-28T18:00:00+00:00http://historicalmodeling.com/distributed-systems/guarantees<p>A distributed system based on durable message queues relies upon three guarantees:</p>
<ul>
<li>Messages will be <strong>delivered</strong> at least once.</li>
<li>Message <strong>duplication</strong> will have no ill effects.</li>
<li>Messages will be delivered in the <strong>order</strong> that they were sent.</li>
</ul>
<p>As it turns out, these guarantees are difficult to ensure. Different distributed systems architectures have different strategies for upholding these guarantees.</p>
<h2 id="delivery">Delivery</h2>
<p>In a distributed system based on message queues, at-least-once delivery is the easiest guarantee. The queue itself is durable. Once a message is queued, the sender relinquishes responsibility. Assuming that the recipient will eventually read from the queue (which our operations team monitors), the message will get delivered.</p>
<h2 id="duplication">Duplication</h2>
<p>Duplication is a little trickier. It usually happens when the recipient has some trouble processing the message.</p>
<p>Queues don’t protect us from duplication. After the recipient processes a message, it removes it from the queue. But if it can’t finish its task, then it has to leave the message on the queue. Depending upon how much of its task was completed, some or all of the work may be duplicated the next time the recipient pulls the message.</p>
<p>One solution to duplication is idempotency. An <strong>idempotent</strong> message is one that will have the same outcome no matter how many times it is processed (assuming no other related messages intervene). Changing a customer’s phone number is idempotent. Charging their credit card is not. Some messages can be designed to be idempotent, but not all.</p>
<p>To protect against duplication of non-idempotent messages, we have two strategies: journaling and transactions. A <strong>journaling</strong> strategy involves keeping track of the steps that have already been completed. We check the journal to ensure that we don’t repeat those steps. A transaction-based strategy involves doing the work in the same transaction as the queue. A <strong>distributed transaction coordinator</strong> (DTC) ensures that both removing the message and completing the work happen as an atomic unit.</p>
<h2 id="order">Order</h2>
<p>Order of delivery is the most difficult guarantee to uphold. One reason is poison messages.</p>
<p>When a recipient fails to process a message, it must leave the message on the queue. Otherwise messages will get lost (see the delivery guarantee). Most failures are <strong>transient</strong>, meaning that they are caused by temporary conditions, and might work if tried again. Deadlocks and timeouts are examples of transient failures. But some failures are intrinsic to the message itself. These <strong>poison messages</strong> will not succeed if retried. If they are left at the top of the queue, they will prevent later messages from being processed.</p>
<p>To detect a poison message, a service typically retries a specific number of times. Once that threshold is exceeded, the message is considered poison. The typical strategy for dealing with poison messages is to move them to a different queue. System operators monitor the poison message queue (also known as a <strong>dead letter queue</strong>) and intervene when messages arrive. They take whatever actions are necessary to ensure that the messages succeed, and then put them back on the application queue.</p>
<p>While this strategy allows the system to continue functioning, it changes the order in which the messages are processed. If a later message depends upon a poison message, then it will be processed in the wrong order. In the best case, the system detects the dependency and treats the later message as poison as well. In the worst case, results are undefined.</p>
<p><strong>Parallelism</strong> can also cause messages to be processed out of order. If multiple nodes pull work from a single queue, there is no guarantee that they will finish that work in the same order that they started. If one of the nodes experiences a transient failure, the problem is exacerbated. While it is working on one message, other nodes will pull messages from further down the queue. If the first message fails, the service will put it back on the top to be processed later.</p>
<p>The most flexible solution to the ordering problem is to ensure that order between messages does not matter. Like idempotency, this can be achieved with most messages, but not with all. When order matters, the system must be programmed to recognize when it is violated. It can then move the later message to the bottom of the queue, thus increasing the likelihood that its prerequisite will be processed first.</p>
<p>Historical Modeling provides the three guarantees in the following ways:</p>
<ul>
<li>A fact is both <strong>data and message</strong>.</li>
<li><strong>Identity</strong> is determined by state.</li>
<li>Predecessors define a <strong>partial order</strong> among facts.</li>
</ul>
<h2 id="data-and-message">Data and message</h2>
<p>Most architectures keep messages separate from the data. RPC messages are just wire protocols. Service busses store messages in queues, not in the database. Brokers persist the state of workflows separately, distinct from the data that those workflows operate on.</p>
<p>Historical Modeling, on the other hand, stores both data and message in facts. Facts store data, and can be queried to find the current state of an entity. Facts also represent messages, and can be queried to find work. When the user performs an action, a fact is stored. This implicitly sends the message.</p>
<p>Some distributed systems architectures rely upon a DTC to ensure that a message is only processed once. If the message fails, then the DTC rolls back both the removal of the message and the database update. But when the repository is the queue, a DTC is not necessary. Handling the message adds the fact that implicitly removes the message from the queue.</p>
<h2 id="identity">Identity</h2>
<p>A historical model takes advantage of immutability to protect against duplication. A historical fact cannot be modified. That immutable state identifies the fact. Any other fact with the same state <strong>is the same fact</strong>.</p>
<p>Suppose that a Stock fact has only one field: symbol. Any Stock where symbol=MSFT is the same fact. If we record a related fact (for example a Purchase by Account(12345) of 300 shares of Stock(MSFT) at 3:49pm on 1/28), then that related fact is also uniquely identified by its collection of fields. Another fact with exactly the same fields will be considered the same fact. It will not be duplicated.</p>
<h2 id="partial-order">Partial order</h2>
<p>Queuing systems attempt to impose a full order among messages. The queue itself does not know when that order is important and when it is not. As a result, it tries to uphold the order guarantee equally in all cases. A full order among messages is over-constrained.</p>
<p>Instead of trying to achieve full order, a historical model defines a partial order among messages. Each message has a reference to its predecessors. These messages must be sent first. The infrastructure knows this, and preserves order when necessary. It does not leave detection up to the application.</p>
<p>On the flip side, the infrastructure also understands when messages are unrelated. In those cases, it can freely violate the order guarantee. The application will not be adversely affected when unrelated messages are processed out-of-order.</p>
<p>A fact only references its direct predecessors. It does not directly reference all facts that were a part of the conversation. Nevertheless, the identity of a fact is dependent upon the identities of its predecessors. And that relationship is transitive. To understand the identity of a fact, a node must receive all of its direct and indirect predecessors. In this way, the predecessor relationships among the facts place them in the correct partial order relative to one another.</p>
<p>The queuing guarantees that distributed systems rely upon are not easy to achieve. Each architecture has its own mechanism for upholding these guarantees. Historical Modeling is no exception. It determines which guarantees are truly important to the correct operation of the system, and upholds them only when necessary.</p>Michael L PerryA distributed system based on durable message queues relies upon three guarantees:Service Bus2011-01-14T18:00:00+00:002011-01-14T18:00:00+00:00http://historicalmodeling.com/distributed-systems/service-bus<p>Message queues can improve the reliability and scalability of a distributed system when carefully applied. They solve the RPC problems of synchronous and unreliable messaging. However, they do not solve the problem of one-to-one coupling and configuration. The sender knows which queue to push messages to, and the recipient knows which queue to pull messages from. Service busses solve that problem.</p>
<p>A service bus is not a single monolithic component. Nor is it an infrastructure stack running on a cluster of machines (for example, BizTalk). Those are brokers. A service bus is a logical relationship among distinct services running on different machines, known as <strong>nodes</strong>. Often, those services use the same framework as one another, but that framework is installed individually at each node. Some popular service bus frameworks for .NET are:</p>
<ul>
<li><a href="http://www.nservicebus.com/">NServiceBus</a></li>
<li><a href="http://ayende.com/Blog/archive/2008/12/17/rhino-service-bus.aspx">Rhino Service Bus</a></li>
<li><a href="http://masstransit-project.com/">MassTransit</a></li>
</ul>
<h2 id="messages">Messages</h2>
<p>The most important task in designing a distributed system is defining the right set of <strong>messages</strong>. A message is an immutable block of information. It contains all of the information needed for the recipient to do something meaningful. Messages typically represent one of two things:</p>
<ul>
<li><strong>Command</strong> – a request for the system to take action</li>
<li><strong>Event</strong> – a notification that something has occurred related to the business domain</li>
</ul>
<p>Commands generally flow from the user of the system. A command might be “Submit order”, or “Accept payment”. Events, on the other hand, flow from one part of the system to another. An event might be “Order submitted”, “Order shipped”, or “Payment received”. Events are named with a past tense verb, while commands have an imperative verb.</p>
<p>A <strong>conversation</strong> is a series of related messages. The messages are all about the same thing (perhaps an order or a patient visit). The messages have a cause-and-effect relationship among them. A command message comes from the user and kicks off the conversation. Then the recipient of that command makes some decision, takes some action, and publishes one or more event messages. Other nodes respond to those events and the conversation continues. Conversations may end quickly, or they may continue for very long periods of time.</p>
<h2 id="handlers">Handlers</h2>
<p><img src="/images/handlers.png" alt="Command handlers" /></p>
<p>A service bus delegates message processing to <strong>handlers</strong>. A handler is a service running on a node that responds to a single kind of message.</p>
<p>There is typically only one handler for each type of command message. When the user issues a “submit order” command, one service is responsible for validating it and entering it into the database. Many nodes may be competing for that command, but only one will be selected to perform it.</p>
<p>On the other hand, there can be many handlers for event messages. Events in a problem domain have lots of side-effects. When an order is submitted, an invoice must be sent, items must be picked and shipped, and customer preference must be adjusted. Each of these side-effects is a separate handler for the “order submitted” event.</p>
<p>Handlers are not coupled to each other. They only know about the types of messages they consume and create. In fact, handlers are not even coupled to their queues. The service bus determines which queue a handler consumes messages from, and which other queues it posts messages to. This is all based on configuration.</p>
<h2 id="configuration">Configuration</h2>
<p>Like I said earlier, a service bus is not one monolithic thing. As such, there is no central configuration. Instead, each node is configured with the queue names and handlers that it needs to play its part in the system.</p>
<p>The user interface (typically a web server) is responsible for sending command messages to the proper handler. Therefore the UI node is configured with the name and location of the queue that each command handler pulls from.</p>
<p><img src="/images/submit_order_command.png" alt="SubmitOrder command" /></p>
<p>The command handler’s node is configured to pull messages from that same queue. But it is not configured with the names and locations of downstream queues. Remember that the command handler is creating event messages, and events have multiple handlers. If each command handler were configured with every downstream event handler, the operational overhead would be unbearable.</p>
<p><img src="/images/submit_order_handler.png" alt="SubmitOrder handler" /></p>
<p>Instead, event handlers <strong>subscribe</strong> to their messages. Command handlers <strong>publish</strong> messages. Subscribers tell the bus which messages they want to receive. The bus is configured with the names and locations of publishers of those messages. The bus registers with those publishers, so that the publishers know where to queue messages without the need of explicit configuration.</p>
<p><img src="/images/order_submitted_event.png" alt="OrderSubmitted event" /></p>
<p>To summarize, a service bus does two things:</p>
<ul>
<li>Decouples handlers from queues</li>
<li>Adds multicast</li>
</ul>
<p>Message queues are infrastructure components that must be explicitly provisioned. Without a service bus to route the messages, each service would have to know which queues to pull from and push to. Because a message is consumed as soon as it is processed, a single message queue does not support multicast. The service bus implements multicast patterns on top of queues.</p>
<h2 id="service-bus-in-a-historical-system">Service bus in a historical system</h2>
<p>In historical modeling, every fact is potentially a queue. This means that a queue is logical, not physical. No provisioning of infrastructure is required to set up a queue. While this doesn’t eliminate the need for a service bus, it does change the nature of that need.</p>
<p>In a historical model, a <strong>fact</strong> plays the part of a message. A fact is a historical record of a decision made either by a user or by the system. A fact can represent both a command and an event.</p>
<p>To begin a historical conversation, the user interface creates an initial fact. This fact acts as both a command and an event. This fact is typically named with a noun. The verb (“submit”, “process”, etc.) is implied. For example, the UI would create the “Order” fact.</p>
<p><img src="/images/order.gif" alt="Order fact" /></p>
<p>The Order is published to the Company. Any node that subscribes to the company will receive the Orders.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Order {
publish Company company;
Customer customer;
OrderLine* items;
}
</code></pre>
</div>
<p>A historical model acts as both message queue and database. As a result, it is not necessary to create a command handler to write the order to the database. It’s already there. Instead, we can focus on the event handlers. Each fact handler is responsible for one side-effect. For example, one handler will ship the order, and another will prepare an invoice. Each side-effect is itself a fact.</p>
<p><img src="/images/invoice_shipment.gif" alt="Invoice and Shipment facts" /></p>
<p>Instead of pulling facts from a physical queue, a handler runs a <strong>query</strong>. The query returns all facts to which the side-effect has not yet been applied. For example, the invoicing service processes all orders that have not yet been invoiced.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Company {
unique;
Order* ordersPendingInvoice {
Order o : o.company = this
where o.isPendingInvoice
}
}
</code></pre>
</div>
<p>When the handler completes its task, it adds the Invoice fact. Adding this fact saves the information to the historical database. But, it also publishes the event for any down-stream handlers (accounts receivable or collections, for example). Furthermore, it effectively removes the Order from the <code class="highlighter-rouge">ordersPendingInvoice</code> query. This is accomplished through the <code class="highlighter-rouge">isPendingInvoice</code> <strong>predicate</strong>.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Order {
publish Company company;
Customer customer;
OrderLine* items;
bool isPendingInvoice {
not exists Invoice i : i.order = this
}
}
</code></pre>
</div>
<p>A service bus in a historical system does not need to route messages to the correct queues. Instead, a historical service bus invokes handlers based on logical subscriptions. Each handler subscribes to a root fact. When subsequent facts are published to that root, the handler is notified. The handler then executes the appropriate query and processes the facts. Its response removes the fact from the query.</p>Michael L PerryMessage queues can improve the reliability and scalability of a distributed system when carefully applied. They solve the RPC problems of synchronous and unreliable messaging. However, they do not solve the problem of one-to-one coupling and configuration. The sender knows which queue to push messages to, and the recipient knows which queue to pull messages from. Service busses solve that problem.Message Queues2011-01-10T18:00:00+00:002011-01-10T18:00:00+00:00http://historicalmodeling.com/distributed-systems/message-queues<p>For a distributed system to work, it has to move information from machine to machine. No single machine is responsible for the system as a whole. But yet all information is somehow related to all other information. So it stands to reason that a major concern of distributed system infrastructure is moving data to the machines that need it. This also ends up being one of the most significant challenges.</p>
<h2 id="remote-procedure-calls">Remote procedure calls</h2>
<p>The simplest way to move information from one box to another is through a <strong>remote procedure call</strong> (RPC). An RPC models the way that code calls functions in a program. The caller passes a packet of information to the recipient as parameters. It then waits for the recipient to do whatever it needs to do, even if the recipient is going to call another procedure. And then the recipient returns another packet of information in the results. This model works fine for programs, but it has some drawbacks in distributed systems.</p>
<p>RPCs are <strong>synchronous</strong>. The calling machine allocates some resources, typically a thread, that stand waiting for the recipient to respond. Sometimes the request is a query, where the caller is waiting for the results before it can continue processing. Other times, it is a command, informing the recipient that it needs to take action. Even if the call is a command that returns <code class="highlighter-rouge">void</code>, the caller must wait for the response. If it doesn’t receive the response, it doesn’t know that the intended recipient received the call. So it has to either fail the request or retry the RPC.</p>
<p>RPCs are also <strong>unreliable</strong>. A local method call in a program cannot typically fail to reach the recipient. But a remote procedure call can get dropped, it can time out, or it can be corrupted. This might not just happen to the request, but also the response. If the call fails, the caller has no way of knowing which was lost. If it was the request, then a retry would be safe. But if it was the response, then a retry might lead to duplication.</p>
<h2 id="advantages-of-message-queues">Advantages of message queues</h2>
<p>Message queues address both of these problems. First, they are <strong>asynchronous</strong>. After the message has been queued, the caller doesn’t wait for it to be processed. It can free up that thread for handling additional work. If the caller expects a response, then the caller must pull response messages from a second queue. While this complicates matters, we have tools (for example <strong>sagas</strong>) to address this new complexity.</p>
<p>Message queues are also <strong>reliable</strong>. Once the sender is sure that the message is in the queue, it can be confident that it will be received once and only once. Message queues typically have a two-phase API for receipt: first the recipient gets the message, and then it commits. If a problem occurs before the commit phase, then the message is “put back on” the queue (in actual fact, it never truly left; it only looked like it did). This two-phase receipt ensures that a message will be processed once and only once. It is no longer the concern of the sender.</p>
<h2 id="disadvantages-of-message-queues">Disadvantages of message queues</h2>
<p>Message queues and RPCs do have one feature in common. They are both <strong>one-to-one</strong> in nature: they transmit information from one sender to one recipient. Like a caller of an RPC, the message sender has some idea about the system for which the message is intended. A queue is not a broadcast mechanism. When one recipient receives the message, it is no longer available for others to pick up. We have additional tools (for example <strong>dispatchers</strong>) to support one-to-many scenarios.</p>
<p>Message queues create additional operational complexity. Every queue must be created, configured, and monitored. Every sender and recipient must be configured with the location and name of each queue. We have created tools (for example <a href="/distributed-systems/service-bus.html"><strong>service busses</strong></a>) to manage this complexity, but they do not eliminate it entirely.</p>
<h2 id="message-queuing-in-historical-modeling">Message queuing in historical modeling</h2>
<p>Historical modeling puts the idea of the message queue into the model itself. Every fact is potentially a queue. A subsequent fact can be published to this predecessor.</p>
<p>Take, for example, a medical claims processing service.model</p>
<p><img src="/images/claim.gif" alt="Claim fact" /></p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Physician {
unique;
}
fact Patient {
unique;
}
fact Visit {
Physician physician;
Patient patient;
date dateOfService;
}
fact Payer {
unique;
}
fact Claim {
publish Payer payer;
Visit visit;
}
</code></pre>
</div>
<p>In this model, the Payer fact acts as a queue. A Claim is published to that queue. This is indicated in the factual code with the <code class="highlighter-rouge">publish</code> keyword, and in the diagram with a red arrow.</p>
<p>When the payer processes the claim, it responds with a remittance advice.model</p>
<p><img src="/images/remittance_advice.gif" alt="Remittance Advice fact" /></p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact RemittanceAdvice {
publish Claim claim;
decimal amount;
}
</code></pre>
</div>
<p>The RemittanceAdvice fact is published to the Claim. The practice subscribes to the claim in order to receive the response.</p>
<p>To make the Payer act as a queue, it needs to query for all of the unprocessed claims:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Payer {
unique;
Claim* unprocessedClaims {
Claim c : c.payer = this
where not c.processed
}
}
</code></pre>
</div>
<p>The query depends upon the <code class="highlighter-rouge">processed</code> predicate:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Claim {
publish Payer payer;
Visit visit;
bool processed {
exists RemittanceAdvice a : a.claim = this
}
}
</code></pre>
</div>
<p>Adding a RemittanceAdvice causes the <code class="highlighter-rouge">processed</code> predicate to become true, thus removing the Claim from unprocessedClaims.</p>
<h2 id="advantages-of-historical-message-queues">Advantages of historical message queues</h2>
<p>The historical modeling message queue pattern has some advantages over traditional message queues. Most significantly, the queue is no longer coupled to a physical location. The practice didn’t know the location or name of the payer’s queue. Neither did the payer know the location of the practice. As long as the sender and recipient share a common upstream server, the claims and remittance advice will flow to the interested parties.</p>
<p>We also have the advantage of creating queues on the fly with no operational overhead. We can add a new payer to the system as easily as creating a new object. Each payer’s service subscribes to its own queue, without the need for configuration. And consider the operational nightmare of configuring a new response queue per practice, let alone per claim as we’ve done in this model.</p>
<h2 id="disadvantages-of-historical-message-queues">Disadvantages of historical message queues</h2>
<p>On the other hand, historical modeling has one significant disadvantage as compared to message queues: it is impossible to ensure that only one recipient handles each message. The two-phase receipt of a traditional message queue lets one service lock a message. Other services pulling work from the same queue will not receive it unless the first service fails. This is an effective technique for load-balancing backend services.</p>
<p>The rules of historical modeling forbid locking. To balance the load among competing backend systems, you must bridge a historical model into a more traditional message queue. The bridge pushes a message onto the queue for each unprocessed fact, and then creates a new fact marking it as received. This bookkeeping fact is not intended for application use, and is typically not published. The historical database and the message queue must be compatible so that they can participate in the same transaction, thus ensuring reliability.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Message queues offer significant advantages over RPCs in distributed systems. Whereas RPCs are synchronous and unreliable, message queues are asynchronous and reliable. They add complexity both to application design and operations, but that complexity can be managed.</p>
<p>Historical modeling supports the concept of a message queue through the <code class="highlighter-rouge">publish</code> keyword, predicates, and queries. It can ease some of the operational complexity, since it decouples senders and recipients from queue location. And since any fact can be a queue, operational overhead is not incurred to add a new queue. However, historical modeling does not support locking, so additional work is required to implement a load-balancing scenario.</p>Michael L PerryFor a distributed system to work, it has to move information from machine to machine. No single machine is responsible for the system as a whole. But yet all information is somehow related to all other information. So it stands to reason that a major concern of distributed system infrastructure is moving data to the machines that need it. This also ends up being one of the most significant challenges.Distributed Systems2011-01-05T18:00:00+00:002011-01-05T18:00:00+00:00http://historicalmodeling.com/distributed-systems<p>Distributed systems are enterprise solutions to large scale business problems running across multiple machines. They typically have several stakeholders, each from different divisions of an organization. Business run on distributed systems, and when the system fails, the business suffers.</p>
<p>The challenges in maintaining distributed systems are not just technical. They span several disciplines, including:</p>
<ul>
<li>Project management</li>
<li>Business analysis</li>
<li>Development</li>
<li>Configuration management</li>
<li>Operations</li>
<li>Database administration</li>
<li>Business intelligence</li>
</ul>
<p>Major business problems include:</p>
<ul>
<li>Identifying the important metrics and indicators to make business decisions</li>
<li>Aligning workflows with business processes</li>
<li>Allocating ownership and funding for system development and maintenance</li>
</ul>
<p>Major technical problems include:</p>
<ul>
<li>Getting relevant data to the appropriate system</li>
<li>Reducing latency</li>
<li>Guaranteeing business continuity in the face of technical outages</li>
<li>Ensuring that no data or transaction is lost</li>
</ul>
<p>Historical Modeling alone does not address all of these concerns. Instead, it works within a framework of thought developed by the brightest minds of the software industry, supported by past experience and ongoing research. This series of articles explores that framework, and the role that Historical Modeling plays within it.</p>
<h2 id="mainstream-guidance-and-tools">Mainstream guidance and tools</h2>
<p>There are large gaps between expert thinking and common practice regarding distributed systems. Mainstream vendors like Microsoft, Oracle, and Force.com provide tools and platforms for a large set of solutions. Those tools and platforms are not intended specifically for distributed systems. When they are inappropriately applied, they often fail. Common failure scenarios include:</p>
<ul>
<li>Lost or duplicated transactions</li>
<li>Inability to scale</li>
<li>Fragility in configuration and operation</li>
<li>Slow or unreliable reporting</li>
<li>Misalignment of technical dependencies with business priorities</li>
</ul>
<p>Mainstream guidance, inappropriately applied, is usually to blame for these failures.</p>
<p>For example, tools like Web Services and Windows Communication Foundation (WCF) lead us into building distributed systems exclusively on <strong>remote procedure calls</strong> (RPCs). RPCs are point-to-point: the caller knows about one specific recipient. RPCs are synchronous: the caller waits for the recipient to respond. RPCs are unreliable: if the call fails, the caller does not know whether the recipient has received the message. All of these factors conspire to cause lost transactions and fragile systems.</p>
<p>Additionally, tools like relational databases lead us to create a <strong>system of record</strong> (SOR). The SOR is the authority for all information pertaining to a specific topic. Having an SOR for each domain encourages us to ask the SOR every time we need information about that domain. This puts unnecessary load on the system, making it difficult to scale. Furthermore, when the SOR is unavailable, all downstream business is affected. When an important business system depends upon a less important system of record, technology is misaligned with business.</p>
<p>Finally, conventional enterprise development has taught us to create an <strong>enterprise data model</strong> (EDM). All of the data needed to run a business is stored in one place, normalized and indexed in one way, and completely interrelated. Updating an EDM in response to an application action sometimes imposes locks on several different tables to guarantee consistency. Reporting against an EDM requires that we join across many tables and run aggregate functions to get the necessary data for decision making. Taken together, this affects the scalability of our system, and the speed of our reports.</p>
<h2 id="distributed-systems-theory-and-practice">Distributed systems theory and practice</h2>
<p>None of these mainstream recommendations is incorrect in its own right. They simply cannot be broadly applied, particularly within a distributed system. Industry experts have known about the problems that inappropriate application causes. They have offered several solutions.</p>
<p>Rather than building distributed systems exclusively with RPCs, experts advise us to use message queues where appropriate. A well-placed message queue breaks the point-to-point coupling between components, leading to less fragile systems and better business alignment. It also creates reliable, transactional intermediate storage so that no messages are lost and the system is generally more reliable.</p>
<p>Instead of directly querying the system of record, experts advise that we should separate queries from commands. <strong>Command query responsibility segregation</strong> (CQRS) is the practice of creating two data stores per domain, one optimized for reads and the other optimized for writes. A background process moves data from the write side to the read side according to a service level agreement (SLA). This allows the system to scale, and still provide quick and reliable reporting.</p>
<p>Finally, rather than always updating state during a transaction, experts sometimes recommend <strong>event sourcing</strong>. This is the practice of recording an event stream, and using that stream as the source of knowledge. The event stream serves as an audit log, revealing every business operation that has occurred within the system. Furthermore, it serves as an authority. Any view of the system can be recreated by replaying the event stream. Correctly applied, event sourcing leads to more scalable and reliable systems that never loose or duplicate transactions.</p>
<h2 id="continue-reading">Continue reading</h2>
<ul>
<li><a href="/distributed-systems/message-queues.html">Message Queues</a></li>
<li><a href="/distributed-systems/service-bus.html">Service Bus</a></li>
<li><a href="/distributed-systems/guarantees.html">Guarantees</a></li>
<li><a href="/distributed-systems/cap-theorem.html">The CAP Theorem</a></li>
</ul>
<h2 id="resources">Resources</h2>
<p>For more information on distributed systems and the experts who have influenced Historical Modeling, please see the following:</p>
<ul>
<li><a href="http://www.udidahan.com/2009/12/09/clarified-cqrs/">Udi Dahan – Clarified CQRS</a></li>
<li><a href="http://codebetter.com/gregyoung/2010/02/13/cqrs-and-event-sourcing/">Greg Young – CQRS and Event Sourcing</a></li>
<li><a href="http://www.allthingsdistributed.com/2008/12/eventually_consistent.html">Werner Vogels – Eventual Consistency</a></li>
<li><a href="https://en.wikipedia.org/wiki/CAP_theorem">Eric Brewer – The CAP Theorem</a></li>
<li><a href="https://www.infoq.com/interviews/eric-evans-ddd-interview">Eric Evans – Domain Driven Design</a></li>
<li><a href="http://martinfowler.com/eaaDev/EventSourcing.html">Martin Fowler – Event Sourcing</a></li>
</ul>Michael L PerryDistributed systems are enterprise solutions to large scale business problems running across multiple machines. They typically have several stakeholders, each from different divisions of an organization. Business run on distributed systems, and when the system fails, the business suffers.Unique Identifiers2009-10-23T18:00:00+00:002009-10-23T18:00:00+00:00http://historicalmodeling.com/examples/work-item-tracker/unique_identifiers<p>Work items need unique identifiers within a project. These aren’t required by the model, since the model uses the “unique” field. Instead, these are required by people talking about and exchanging email regarding work items. They are also used to reference work items in external systems that are not connected to the historical model. This unique identifier has to be human readable.</p>
<p><img src="/images/uniqueidentifiers.jpg" alt="Unique Identifiers" /></p>
<p>Any client can create a work item without deferring to a central authority. The client cannot assign a unique identifier to the work item. Until the client synchronizes, the work item remains anonymous. After it synchronizes, a centralized work item identification service can assign it a unique identifier.</p>
<h2 id="work-item-identification-service">Work item identification service</h2>
<p>We install a centralized service running on one machine. This service claims a project for which it will generate identifiers. By mutual agreement, no other service can claim that project. This is not enforced by the model, just by convention.</p>
<p>The work item identification service has a proxy fact that exists in the historical model. From this proxy fact, it can query for work items that need identifiers.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact WorkItemIdentificationService
{
Project project;
// Find the work items that need identifiers.
WorkItem *unidentifiedWorkItems
{
WorkItem wi : wi.project = this.project
where not exists wi.identifier
}
}
</code></pre>
</div>
<p>For the first unidentified work item that it finds, the identification service creates a unique identifier.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Identifier
{
Project project;
string identifier;
// Get the work item for this unique identifier.
// This will be assigned by an identification service.
WorkItem* workItem
{
WorkItemIdentifier wiid : wiid.identifier = this
WorkItem wi : wiid.workItem = wi
}
}
</code></pre>
</div>
<p>It then assigns this identifier to the work item via an associative WorkItemIdentifier fact.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact WorkItemIdentifier
{
WorkItem workItem;
Identifier identifier;
}
</code></pre>
</div>
<p>The work item can then query for its identifier through this association.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact WorkItem
{
unique;
Project project;
property string description;
// Find the unique identifier of this work item.
// This will be assigned by an identification service.
Identifier* identifier
{
WorkItemIdentifier wiid : wiid.workItem = this
Identifier id : wiid.identifier = id
}
// ...
}
</code></pre>
</div>
<p>When the work item is first created, it has not yet been given a unique identifier. The <em>identifier</em> query will be empty, so the work item will appear in <em>unidentifiedWorkItems</em>. After the identification service has given it an identifier, the <em>identifier</em> query will no longer be empty. At that point the work item will no longer appear in <em>unidentifiedWorkItems.</em> The query acts as a queue.</p>Michael L PerryWork items need unique identifiers within a project. These aren’t required by the model, since the model uses the “unique” field. Instead, these are required by people talking about and exchanging email regarding work items. They are also used to reference work items in external systems that are not connected to the historical model. This unique identifier has to be human readable.Ownership2009-10-12T18:00:00+00:002009-10-12T18:00:00+00:00http://historicalmodeling.com/examples/work-item-tracker/ownership<p>Whereas a developer can be a member of several projects over time, a work item belongs to only one project. It is part of that project when it is created, it cannot be a part of multiple projects, and it cannot be moved to another project. This is strict ownership.</p>
<p>Since all fields in a historical fact are immutable, ownership is simply represented as a field referencing the owner.</p>
<p><img src="/images/ownership.jpg" alt="Ownership" /></p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact WorkItem
{
unique;
Project project;
property string description;
}
</code></pre>
</div>
<p>Besides the project, there are no other immutable attributes of a work item. Its description could change. Its type (defect, enhancement, user story, etc.) could change. It could be assigned to different developers over time. So we need the “unique” field to differentiate between work items within a project.</p>
<p>An owner usually knows about its children. This is accomplished with a query:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Project
{
unique;
property string name;
WorkItem* workItems
{
WorkItem wi : wi.project = this
}
Developer* developers
{
ProjectMembership m : m.project = this
Developer d : m.developer = d
}
}
</code></pre>
</div>
<p>In this case, however, a query listing all of the work items in a project is probably not useful. There are better ways to organize things.</p>Michael L PerryWhereas a developer can be a member of several projects over time, a work item belongs to only one project. It is part of that project when it is created, it cannot be a part of multiple projects, and it cannot be moved to another project. This is strict ownership.Audit Trail2009-10-12T18:00:00+00:002009-10-12T18:00:00+00:00http://historicalmodeling.com/examples/work-item-tracker/audit_trial<p>To track assignments of work items to developers, we’ll create a folder concept.</p>
<p><img src="/images/audit_trail.jpg" alt="Audit Trial" /></p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Folder
{
unique;
ProjectMembership membership;
property string name;
}
</code></pre>
</div>
<p>Project membership is a prerequisite for creating a folder. A developer can create as many folders per project as he needs. A work item can be assigned to a folder, which implies that it is assigned to the associated developer.</p>
<p>The simplest way to assign a work item to a folder would be to create a property.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact WorkItem
{
unique;
Project project;
property string description;
property Folder assignedTo;
}
</code></pre>
</div>
<p>A work item is assigned to a folder. That assignment can be changed at any time.</p>
<p>The problem with this solution is that it hides the history of folder assignments. An important part of tracking work items is to see their progress as they move from folder to folder. To make this more explicit, we’ll express the full fact structure rather than using the “property” shorthand.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Assignment
{
WorkItem workItem;
Folder assignedTo;
Assignment* prior;
bool current
{
not exists Assignment next : next.prior = this
}
}
</code></pre>
</div>
<p>This gives us the ability to explicitly query for either the current folder or the history of all assignments:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact WorkItem
{
unique;
Project project;
property string description;
Folder* assignedTo
{
Assignment a : a.workItem = this where a.current
Folder f : a.assignedTo = f
}
Assignment* assignments
{
Assignment a : a.workItem = this
}
}
</code></pre>
</div>
<p>This also gives us a place to hang a note. A developer optionally enters a note when making the assignment, or any time while working on an assignment.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>fact Note
{
Developer by;
Assignment assignment;
string text;
}
fact Assignment
{
WorkItem workItem;
Folder assignedTo;
Assignment* prior;
bool current
{
not exists Assignment next : next.prior = this
}
Note* notes
{
Note n : n.assignment = this
}
}
</code></pre>
</div>
<p>Properties hide the audit trail from the application. But by making properties explicit, we can both query and annotate the audit trail.</p>Michael L PerryTo track assignments of work items to developers, we’ll create a folder concept.