When a system outage interrupts the processing of events by a StreamInsight application, you typically have requirements for the quality and timeliness of the application's output after recovery.
You want the content of output streams to match what it would have been if no outage had occurred.
You want the duration of the outage to be as brief as possible.
The Premium edition of StreamInsight provides a checkpointing feature that can periodically save the state of queries to disk. You can use this feature along with related features of input and output adapters to achieve equivalence of the output stream after recovery from an outage. Since properly written input adapters replay only the missed events since the last checkpoint that was captured, the amount of time spent in recovery is kept to a minimum.
A StreamInsight checkpoint operation persists the state of a query to disk in a consistent manner. After an outage, the query can then be restored to its state as of the checkpoint.
Checkpointing alone does not guarantee that the stream of events produced by a query in the absence of an outage will be equivalent to the stream of events produced after an outage occurs. Two problems can affect the equivalence:
Events can be missed. The events received by StreamInsight after the checkpoint, and the events that occurred between an outage and recovery, are not captured by a checkpoint. These events must be presented to the server again to be included in query output. Solving this problem requires the participation of input adapters that are capable of replaying the missed events.
Events can be duplicated. The events produced by StreamInsight after the last checkpoint before the outage will be produced again during recovery from the outage when input adapters replay events as expected. Solving this problem requires the participation of output adapters that are capable of eliminating these duplicate events.
The checkpoint log is the set of files that contain the persisted checkpoint information. The log is saved to a directory that you specify when you configure the server for resiliency. This directory should be reserved for use only by StreamInsight and should be treated as opaque.
There are three levels of resiliency that you can achieve with StreamInsight. Selecting a level depends on your requirements and your ability to change existing applications and adapters.
State retention. You can use checkpoints to save the state of queries without making any changes to input or output adapters. This level of resiliency does not guarantee that the resulting stream after recovery from an outage is equivalent to the stream if no outage had occurred, because events that occurred after the last checkpoint was captured have been lost. However, this may be acceptable in situations where equivalent results are not needed, and where approximately correct output can be achieved with partial input.
Complete output. You can guarantee that no events will be missed by changing input adapters to replay events. The output stream from a recovered query will be logically equivalent to a superset of the output stream from an uninterrupted query, and the additional events will be duplicates of events in the uninterrupted stream.
Equivalent output. You can guarantee logically equivalent output by changing input adapters and also changing output adapters to eliminate duplicate events.
The high-water mark is the highest application time seen up to a specific point in the stream of events. When a checkpoint is requested, StreamInsight captures a checkpoint as of a high-water mark on each of the inputs.
To understand the prerequisites for output that is complete and equivalent after recovery from an outage, it is useful to recognize the events and state that cannot be saved by StreamInsight checkpointing. These events and this state must be persisted separately to be available after recovery from an outage.
Events or state that cannot be saved by checkpointing
Events that arrive after the last checkpoint and before the outage
To be available for replay after recovery from an outage, these events must be persisted in a data store.
Events that arrive during the outage
To be available after recovery from an outage, these events must be persisted in a data store.
Knowledge of events that were produced as output after the last checkpoint and before the outage
To support the removal of duplicate events by output adapters after recovery, these events must be persisted in a data store.
Any state that is maintained by custom input or output adapters
To be available after recovery from an outage, this state must be persisted in a data store by the custom input or output adapters.
When a StreamInsight application restarts after an outage, the call to the Create method of the input adapter factory provides the high-water mark to the adapter factory. (The adapter factory must implement the IHighWaterMarkInputAdapterFactory or IHighWaterMarkTypedInputAdapterFactory interface in order to receive this information.) The input adapter is expected to replay its input stream from the high-water mark.
Correct replay by all input adapters guarantees output that is complete.
Therefore output that is complete has the following requirements:
An input adapter factory that implements the IHighWaterMarkInputAdapterFactory or IHighWaterMarkTypedInputAdapterFactory interface.
The availability after recovery of all the events that occurred after the last checkpoint that was captured before the outage.
The availability after recovery of all the events that occurred during the outage.
The correct replay of these events by all input adapters.
Checkpointing and the recovery of query state.
To identify the location of the checkpoint in the output stream, the call to the Create method of the output adapter factory provides both the high-water mark and an offset from this high-water mark. (The adapter factory must implement the IHighWaterMarkOutputAdapterFactory or IHighWaterMarkTypedOutputAdapterFactory interface in order to receive this information.) This offset is necessary because the location in the output stream corresponding to the checkpoint may fall at any point in the stream.
If a query is properly replayed, then the internal query state will be as of the last checkpoint, and any events that were produced after the last checkpoint was taken will be produced upon restart. This means that any events that were produced as output after the last checkpoint but before the outage will be produced a second time during recovery. These are the duplicates that the output adapter must remove. How these are removed is up to the output adapter; for example, the duplicate copies can be ignored.
Correct elimination of duplicates by all output adapters (after correct replay by all input adapters) guarantees output that is equivalent.
Therefore output that is equivalent has the following requirements in addition to the requirements listed previously for complete output:
An output adapter factory that implements the IHighWaterMarkOutputAdapterFactory or IHighWaterMarkTypedOutputAdapterFactory interface.
The availability after recovery of all the events that occurred after the last checkpoint that was captured before the outage. (This location in the stream is identified by the high-water mark and offset that are provided to the output adapter factory when it is created.)
The correct removal of duplicate events by all output adapters.
For more information about building, monitoring, and troubleshooting resilient StreamInsight applications, see the following topics:
For an end-to-end code sample of a resilient application that includes replay and deduplication, see the Checkpointing Sample on the StreamInsight Samples page on Codeplex.