{"id":7837,"date":"2019-01-30T12:15:31","date_gmt":"2019-01-30T12:15:31","guid":{"rendered":"https:\/\/support.loginextsolutions.com\/?p=7837"},"modified":"2026-01-28T09:43:18","modified_gmt":"2026-01-28T09:43:18","slug":"fault-failure","status":"publish","type":"post","link":"https:\/\/support.loginextsolutions.com\/index.php\/2019\/01\/30\/fault-failure\/","title":{"rendered":"Failure Detection and Fault Tolerance at LogiNext"},"content":{"rendered":"\n\n\n\t<div class=\"dkpdf-button-container\" style=\" text-align:left \">\n\n\t\t<a class=\"dkpdf-button\" href=\"\/index.php\/wp-json\/wp\/v2\/posts\/7837?pdf=7837\" target=\"_blank\"><span class=\"dkpdf-button-icon\"><i class=\"fa fa-file-pdf-o\"><\/i><\/span> Download PDF<\/a>\n\n\t<\/div>\n\n\n\n\n\n<p>LogiNext\u2019s production environment is robust and dependable. The implementation of the Fault Tolerance techniques is based on a circular chain reaction of Failure, Error and Fault.<\/p>\n<p>A failure is said to occur in a system when the system\u2019s environment observes an output from the system that does not conform to its specification. An error is the part of the system, e.g. one of its constituent (sub)systems, which is liable to lead to a failure. A fault is the adjudged cause of an error and may itself be the result of a failure. Hence, a fault causes an error that produces a failure, which subsequently may result to a fault, and so on.<\/p>\n<p>All the failures are assessed and handled effectively and the system reacts gracefully to any unexpected equipment or programming malfunction by using world class techniques to confront the faults and its consequences. This is achieved by:<\/p>\n<p>1. First detecting the errors in the system &#8211; This is achieved by enabling the constituents of LogiNext\u2019s environment to monitor other constituents for failure occurrences. By observing a failure, the monitoring subsystem can detect an error on the monitored subsystem.<\/p>\n<p>2. Restoring or Recovering the system or subsystem on which error was detected before it affects other parts of the system\u00a0 &#8211; This is achieved through a technique called checkpointing. In this technique, in order to enable the restoration of a subsystem after an error has been detected on it, appropriate information regarding the subsystem is saved at regular intervals of time. The appropriate information saved is a complete snapshot of the internal subsystem representation (i.e. the state of the subsystem). When a monitoring subsystem observes a failure on a monitored subsystem, it activates a mechanism that will use the last checkpoint of the latter subsystem in order to eliminate the error that led to the observed failure and restore the subsystem to an error-free state.<\/p>\n<p>3. Masking of the error occurrence by isolating the subsystem on which the error was detected and using some form of redundancy to deliver the expected output\u00a0 &#8211; When a monitoring subsystem observes a failure on a monitored subsystem, it does not let the erroneous behaviour of the latter subsystem affect any other parts of the overall system by using a some form of redundancy by spawning a duplicate of the failed subsystem to cover up for the observed failure.<\/p>\n<p>The decision to deploy the right fault tolerance mechanism is decided by LogiNext architecture based on number of factors that include &#8211; simultaneous errors that may occur, the design, space and time complexity of the fault tolerant mechanism and how these align with the requirements about the corresponding system qualities. However, the failure types that will be confronted by a system play a primary and decisive role in the selection of the fault tolerance techniques that can be applied to render the system fault tolerant. Different fault tolerance techniques are developed and deployed by LogiNext in order to deal with different failure types which may differ in all three means for error detection, recovery and masking.<\/p>\n<p>LogiNext has implemented the fault tolerance techniques that considers the below metrics &#8211;<\/p>\n<ul>\n<li><strong>Throughput: <\/strong>Through this parameter, it is checked that what is the number of tasks which are executed completely. Throughput of a system is always set to be high.<\/li>\n<li><strong>Response Time: <\/strong>The time taken by a system to respond is set as minimalistic.<\/li>\n<li><strong>Scalability: <\/strong>Fault Tolerance capacity of the system is designed to be independent of the number of nodes in that system.<\/li>\n<li><strong>Performance: <\/strong>This parameter checks the effectiveness of the system. Performance of the system has to be enhanced at a sensible cost e.g. by allowing acceptable delays the response time can be reduced.<\/li>\n<li><strong>Availability:<\/strong> The fault tolerance technique is devised to ensure that\u00a0 the system is functioning at any given instance ion time and under all the defined circumstances. The fundamental that Availability of a system is directly proportional to its reliability is always respected.<\/li>\n<li><strong>Usability:<\/strong> The fault tolerance technique is implemented to not to set any limit on the extent to which a product can be used by our customers to achieve their goals with effectiveness, efficiency, and satisfaction.<\/li>\n<li><strong>Reliability:<\/strong> In a time bounded environment, this aspect is devised such that it aims to give correct or acceptable result.<\/li>\n<li><strong>Overheads:<\/strong> The overheads imposed because of the task movements, inter process or inter-processor communications are kept at minimum to have efficient fault tolerance technique.<\/li>\n<li><strong>Cost Effectiveness:<\/strong> Here the cost is only defined as a monitorial cost.<\/li>\n<\/ul>\n<p>LogiNext has implemented the Heartbeat strategy for failure detector which is also known as \u201cI am Alive\u201d technique. Heartbeat is a widely implemented strategy for failure detectors.<\/p>\n<p>In this technique, after a fixed interval of time every process say \u201cX\u201d send \u201cI am Alive\u201d message to another process say \u201cY\u201d. The process \u201cY\u201d waits for the message from \u201cX\u201d till the expiration of timeout from \u201cX\u201d and if the message is not received it adds \u201cX\u201d to list of suspected processes. If \u201cY\u201d later receives \u201cI am Alive\u201d message from \u201cX\u201d, it removes the process \u201cX\u201d from list of suspected processes.<\/p>\n<p>In the above context, the I Am Alive pattern has the below benefits &#8211;<\/p>\n<ul>\n<li>The detection of an error on the monitored system takes place as soon as possible, even before the environment communicates with the monitored system. It introduces low time overhead.<\/li>\n<li>The monitoring system is regularly updated with the knowledge regarding the occurrence of errors on the monitored system.<\/li>\n<li>The detection time is independent from the last heartbeat message, thereby increasing accuracy of the failure detector as it avoids premature timeout.<\/li>\n<li>The design complexity introduced by the I Am Alive pattern is low<\/li>\n<li>The I Am Alive pattern does not introduce any space overhead. Both the timer (which is responsible for counting down from the timeout continuously until it receives an &#8220;I am alive&#8221; signal or until the timeout expires) and the beacon (which sends &#8220;I am alive&#8221; signals in regular time intervals) entities do not map to additional architectural or software components; rather they describe some additional functionality associated with the monitoring and the monitored system respectively<\/li>\n<li>The I Am Alive pattern detects error on the monitored system at a regular basis, even during long idle communication periods.<\/li>\n<\/ul>\n<p><b>Some critical failure types:<\/b><\/p>\n<p>1. Fail-stop failures where the failed system ceases execution without producing any output and the failure is detectable by its environment.<\/p>\n<p>2. Crash failures where the failed subsystem ceases execution without producing any output but the failure might not be detectable by its environment.<\/p>\n<p>3. Omission failures where a subsystem fails to deliver output to (send omission), or receive input from (receive omission) its environment.<\/p>\n<p>4. Byzantine failures where the failed subsystem exhibits arbitrary behaviour.<\/p>\n<p><!-- Created with Elementor --><!-- Created with Elementor --><!-- Created with Elementor --><!-- Created with Elementor --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Download PDF LogiNext\u2019s production environment is robust and dependable. The implementation of the Fault Tolerance techniques is based on a circular chain reaction of Failure, Error and Fault. A failure is said to occur in a system when the system\u2019s &hellip; <a href=\"https:\/\/support.loginextsolutions.com\/index.php\/2019\/01\/30\/fault-failure\/\">Continued<\/a><\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[],"tags":[1141,3016,3015,3014,3013,3011,1146,1145,1144,1143,1142,1016,1140,1139,1138,1137,1136,1135,1113,1104,1019],"_links":{"self":[{"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/posts\/7837"}],"collection":[{"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/comments?post=7837"}],"version-history":[{"count":4,"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/posts\/7837\/revisions"}],"predecessor-version":[{"id":28087,"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/posts\/7837\/revisions\/28087"}],"wp:attachment":[{"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/media?parent=7837"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/categories?post=7837"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/support.loginextsolutions.com\/index.php\/wp-json\/wp\/v2\/tags?post=7837"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}