Dependability in Robotics Software
From RoSta
The current document lacks references, because of some convertion problems from Latex but hopefully will be solved soon.
The .pdf version of this document can be found in the references section of the wiki, under the name "Fault Tolerance and Robustness in Robotics: A Survey "
Abstract: In recent years, technologies such as robotic vacuum cleaners, autonomous rehabilitation systems for elderly, manipulators in automative plants and rovers exploring planets have become commonplace. For all these applications to work successfully, the robotic systems used have to provide a high level of dependability. What happens on a technical level when these systems fail to function properly and their failure may cause unpredictable harm to their technical, natural or human environment? What has to be done or is already done in order to prevent these autonomous devices from malfunctioning and to increase their dependability? These are the questions that we are going to take up in this report. We will try to classify the current field of dependable robotics and specifically concentrate on the aspects of robustness and fault tolerance. At the core of this survey on the state of the art in this field of research lies a comprehensive and critical taxonomy of the approaches commonly used.
Contents |
[edit] Introduction
In the last decade the number and the importance of “intelligent” machines in everyday human life has increased significantly. They are involved in various fields, which range from household and entertainment to industry and space applications. Parallel to their proliferation, the quality of the systems used has grown considerably too, especially with regard to autonomy, perception, actuation and the robustness of software and hardware. Notwithstanding these progresses there still remains an almost ironical problem in human-machine interaction. Algirdas Avizienis sums it up as the contemporary paradox view:
Computing systems provide protective infrastructures for critical infrastructures of modern society: electrical power, telecommunications, transportation..., But, These computing systems do not possess a protective infrastructure of their own.
The issue, though, is even more intricate: Devising protective means for autonomous systems against being damaged is without doubt important and necessary. But what is equally important and necessary, is to devise protective means for autonomous systems not to cause damage to their surroundings and themselves. But the growing complexity of hardware and software components being utilized in the development of robotic applications imposes difficulties for achieving satisfactory levels of dependability. With increasing refinement in the development phase and utilization processes, the product is going to be more susceptible to errors, faults and failures, which, in turn, decrease the reliability and availability of the system1. What is called for are technical means to reduce the affect of these problems and, if possible, to reverse them and restore the system to its original state. These means are usually subsumed under the terms fault tolerance, fault prevention, fault removal and fault forecasting, to which we would like to add robustness as the fundamental system property against threats and uncertainities.
The aim of this paper is to present an overview of the contemporary theory and practice in this domain, which lies at the crossing of engineering and computing sciences, to depict, evaluate and, whenever applicable, merge approaches from these fields and to point to future areas of research. We will first provide an outline of concepts and terminology used in dependable computing (Chapter ??), and will then proceed to state of the art theoretical approaches to the problem of dependability (Chapter ??), before looking at their practical applications in a detailed survey and evaluation of specific robotic software integration systems (SIS) (Chapter??). The paper will end with a summary and suggestions for future trends.
[edit] Dependability: Concepts and Terminology
[edit] Taxonomy
Before commencing our analysis and evaluation of the state of the art in approaches to robotic fault tolerance and robustness, we want to elaborate on fundamental concepts and terminologies which are widely used in the related domains of engineering and computer science. Though the field of fault tolerance and dependability as a whole is quite mature, there is no generally accepted nomenclature. In most cases, whenever researchers discuss a particular topic, the terms they utilize are either idiosyncratically or inconsistently used. This lack of a common language clearly hinders progress in the field. For this reason many researchers demand the development of a distinct and exact taxonomy as a solution to this communication problem. A taxonomy like this would provide the basic means to evaluate and compare different approaches. Several attempts have been made, some being still work in progress, to lay down a taxonomy which can systematically structure the field. Particularly noteworthy initiatives have been launched by the “SAFEPROCESS” Technical Committee in the domain of control engineering the International Foundation for Information Processing (IFIP) Working Group (WG) 10.4 on Dependable Computing and Fault Tolerant Systems in the domain of computer science. The results of these initiatives are taxonomies accepted by their respective research communities. Unfortunately, as already mentioned, the same does not hold true for the robotics domain. Being a multidisciplinary
science, robotics makes it obligatory to take into account ideas suggested by various scientific communities, which might explain the inconsistencies in the use of terminology so far. By bringing together differing nomenclatures from the fields of control engineering and computer sciences, by comparing them with and transferring them to the field of robotics, whenever applicable, we attempt to clarify frequent misunderstandings and contribute to the development of a consistent language, which will also lay the terminological basis for the rest of the paper. As a basis for exploring this issue, we adopted the taxonomies provided by Laprie et al. in [???].
According to Laprie et al. dependability can be defined “as the trustworthiness of a computer system such that reliance can justifiably be placed on the services it delivers”. It is a generic concept which characterizes various facets of a computing system. The composition of these facets is dependent on the specific application. As an example, for a robotic application, which is required to be reliable and licensed under open source, security will not be a top priority for this software system 2. Software for financial transaction systems, though, will require high reliability and security. Additionally, it has to be emphasized, that the various facets of dependability are complementary to each other, i.e. one has to consider and cope with all of them in order to attain high levels of system dependability. Figure [????] depicts these constituents of dependability. As can be seen, there are three main constituents: attributes, threats and means. Robustness is considered to be a secondary attribute in [???], but with regard to the broad usage and importance of this concept in robotics we decided to grant it the same priority as the other three aspects. We investigate each of these concepts in the following subsections.
[edit] Attributes of dependability
Attributes are system properties the levels of which define the dependability of the system. These are:
- Reliability - defined as the ability of a system to provide correct functions under specified conditions during a particular period of time.
- Availability - defined as the ability of a system to provide correct functions under specified conditions at a particular time instant.
- Safety - defined as the ability of a system not to cause damage to its external environment.
- Security - defined as the ability of a system to prevent unauthorized access or handling of system information.
- Maintainability - defined as the ability of a system to undergo correction or modification services.
This list can be prolonged with other attributes, also referred to as “-ilities” like flexibility, portability, testability etc. Since the dependability of a particular system is specific to its application, the list of pertinent attributes is bounded by the needs of that application. It would be naive to set up a list of attributes a robotic system should possess without knowing the peculiarities of the domain it is used in. Still, the attributes listed above are the standard ones commonly present in engineering and computational systems. For more detailed information on dependability attributes we confer to [????].
[edit] Threats to dependability
It is often difficult to reach the system dependability desired. Among the key factors affecting dependability attributes are “threats” to the functionality of the system. Threats can be described as external or internal adverse conditions or events which may cause the system to degrade and may eventually lead to a completely non functional state. They can be classed into three categories, namely faults, errors and failures. These categories of threats are closely related to each other and their relation can be depicted in form of a chain, where each link is a representation of a particular threat (Figure [???]. This chain is referred to as a “fault path”. According to the logic of this diagram faults, errors and failures can be seen as the results of a cause-effect relation. At the same time the diagram can be interpreted as the representation of the direction threats take and consequently the level of system abstractions they should be dealt with at (note that the fault path has a cyclic nature)[????].
Figure 2: A chain representation of the relation between dependability threats - A Fault Path [???]
Although the fault chain depicted communicates the basic knowledge on the difference of the various threats, it does not reveal information on how to cope with them. The first step towards identifying methods for “threat handling” is to develop a taxonomy for each of the afore mentioned threats themselves, because a particular type of fault, error or failure may require different handling methods.
[edit] Faults
There are various definitions in use on what a fault is. Some authors define it as a not permitted deviation of system parameters from the specified values [???], whereas others describe it as the cause of an error [???]. The semantics in most cases reflects the domain the term is used in.
One can identify a vast variety of fault types, and the inventory tends to grow. According to [???], one can differentiate faults according to eight viewpoints:
- Phase of creation - two types of faults can be identified with regard to their time of creation: (a) development phase faults - these are faults committed during the development phase of a system and (b) operational phase faults - these faults affect the system during its operational phase.
- System boundaries - here faults are differentiated according to whether they are internal to the system, (a) internal faults, or come from its external environment, (b) external faults.
- Phenomenological cause - here faults are classified according to their phenomenological nature. There are two types of this class of faults (a) natural faults, i.e. faults which are caused by natural phenomena and (b) human-made faults.
- Dimension - computational systems being composed of software and hardware, two types of faults can be associated with them (a) hardware faults and (b) software faults.
- Objective - this type of faults is usually taken into account when security is concerned. They can be differentiated as (a) malicious and (b) non-malicious faults.
- Intent - similarly to the objective class of faults, the faults in this category also mainly affect the security standards of a system. One can differentiate (a) deliberative faults, which are carried out to cause a harm or damage and (b) non deliberative faults, which are usually introduced by a user or developer without being aware of them.
- Capability - this type of faults is closely related to the human factor, which plays an important role in the development phase of a system. With respect to this, two types of faults are differentiated (a) accidental faults, which are similar in their semantics to non deliberative faults and (b) incompetence faults, which are often caused by a lack in competence of the parties involved.
- Persistence - one of the important and often considered aspects of faults is their temporal attributes. This factor determines how complex the diagnostic approach shall be. There are two kinds of faults in this class (a) transient or also sometimes called intermittent faults . The transient faults are difficult to diagnose because of their temporary nature. (b) Persistent faults are continuously present in a system. This condition simplifies the diagnosis problem.
As is quite obvious, the taxonomy presented above includes a strong overlap among the classes of faults. This situation is further enforced by the fact that usually faults occur as a combination of several different faults. Taking this into account, one can further define fault classes which are orthogonal to the afore mentioned ones. In this approach, faults are grouped into three classes [???].
- Development faults - this group comprises faults which occur during the development phase of a system.
- Physical faults - this group includes all faults related to the hardware of a system.
- Interaction faults - this group includes all faults caused during the interaction of the system with the environment.
Figure [???] depicts the relation of the taxonomies discussed.
Figure 3: The relation between two classifcation views on faults [<A HREF="#59"></A>].
With respect to the domain of robotics, we want to emphasize the work by Chatila et al. The authors provide a fault taxonomy for robotics, particularly on aspects of robotics addressing decisional mechanisms [??], [??], [??]. They define a decisional mechanism as a composition of inference mechanisms and the knowledge these mechanisms work on. Further, these two constituents are separated according to their appearance in the life cycle of an application, the life cycle of a system being differentiated into two phases: the development phase and the operational phase. The resultant taxonomy is depicted in figure [?? ].
It can be seen that some aspects of this approach directly reflect the classification presented in figure [??] with some details specific to robotics. Another approach to the classification of faults in robotics is presented by Murphy et al. Here, the authors put more emphasis on the concept of failures [???]. They also list two key attributes of failures. They are related to the level of the impact of a failure on the robot's mission and to the level of repairability. This approach is depicted in figures [??] and [??] respectively. The authors define terminal failure as a failure terminating the robot's mission and non-terminal failure as a failure introducing a noticeable degradation in the robot's performance to achieve its mission. With respect to repairability of a failure, a failure is called field-repairable if the robot can be repaired using its accompanying repair tools, otherwise it is classified non-field repairable. Although the authors in their classification refer to faults, in our view and when compared with the taxonomies in figure [??] and figure [??], this classification is more concerned with faults rather than with failures.
Figure 6: Attributes associated with failures in field robotics by Murphy et al.
[edit] Errors
The next link in the threat chain representation was given to errors. According to this representation, an error can be described as an effect of a system fault and a cause of a system failure (Figure [??]). As with faults, also in the field of errors a great variety of error types can be differentiated. Examples are single byte vs. double byte errors, when considered from the perspective of control codes, and single errors vs. multiplicative errors with regard to the number of components in which they occur. But being part of a system state, they are usually classified according to the type of failures they may lead to. Only errors which manifest themselves in external system states, i.e. services as perceived by an end user, lead to a system failure, though [???], [???], [???]. We are going to discuss failures shortly, but before we want to mention three key factors which effect the manifestation of an error in a system failure.
- The level of system redundancy - redundancy is a means intended to prevent errors from leading to a system failure. Two types of redundancy can be differentiated (a) intentional redundancy, which is explicitly integrated into the system with the purpose of error prevention and (b) unintentional redundancy also referred to as false redundancy which is often present in system itself without being intentionally included. It may lead to the same effect as intentional redundancy, though.
- The level of system activity or behaviour - an error may be overwritten before causing a system failure or it can never be observed because it lies in the part of a system which is never used.
- A definition of a failure from the perspective of the user - some users may consider specific conditions as failures, but the same conditions may be acceptable to others.
[edit] Failures
A failure is a direct result of an error on external system state. Since the external system state is a part of the service delivered, a failure is a deviation of the end services from the specified ones. A system may fail in different ways depending on the final effect of the three factors indicated above on the external states' errors. The ways the system may fail are referred to as “failure modes”. This concept introduces a relative measure to the level of “impact” caused by a failure. The failure classification is usually performed based on this concept of failure modes. One can differentiate four viewpoints to failure mode classification [???], [??].
- Failure domain - this viewpoint defines two types of failures (a) value failures , where the value of the delivered service is not compliant with the one in the specification and (b) timing failures, which occur when the timing of a system does not comply with the specification.
- Detectability of a failure - this class defines failures according to whether they can be detected or not. Here, one can differentiate (a) signaled failure, in case it is possible to detect a failure and notify, otherwise the failures are (b) unsignaled failures.
- The consistency - according to this viewpoint, failures are classified according to whether they are perceived the same by all the users using the service. Hence, (a) consistent failures, where all users perceive incorrect services in the same way and (b) inconsistent failures , where failure appears differently to users. They are referred to as Byzantine failures.
- The consequences - from this perspective, the impact of a failure on the surrounding is considered. Usually, one can differentiate several levels of severity, but at least two of them are defined: (a) minor or safe failures and (b) dangerous or catastrophic failures. This viewpoint is one of the commonly used approaches to distinguish failure modes in safety engineering.
We summarize our elaboration on threats and their classification by means of a comparison table, figure [????], which provides definitions to the terms discussed from the viewpoint of control engineering and computer science (dependable computing).Here the definitions are based on the nomenclature provided in [???] and [??]. They reflect the viewpoints of the dependable computing community in the view of IFIP WG10.4 and control engineering community in the view of “SAFEPROCESS”.
Figure 7: A comparison table of fault terminology in control engineering and dependable computing
[edit] Means of dependability
With regard to the wide diversity of faults and their effects, what options does a system designer or developer have at hand to improve the dependability attributes of a system? Considerable research has been conducted on this topic and currently one can distinguish the following four methods of addressing the problems resulting from system threats. In the context of dependable computing these four methods are referred to as means to improve the system dependability attributes [figure [??]]. They are: fault tolerance, fault prevention, fault removal and fault forecasting. Further, they can be separated into two groups, namely, dependability procurement and dependability validation [???], [??], [??]. Dependability procurement, which includes fault tolerance and fault prevention, is usually aimed at how to provide a system with the ability to deliver correct services, whereas dependability validation, which includes fault removal and fault forecasting, is an approach to reach confidence in the system's ability to deliver correct services. In their new taxonomy Laprie et al. also differentiate two groups of possible improvements in dependability with respect to the status of threats in a system state and appropriate actions to handle them. Group (a) fault avoidance includes fault prevention and fault removal processes. It is aimed at avoiding the introduction of new faults or removing the ones which were already present in the system. Group (b) fault acceptance, which includes fault tolerance and fault forecasting processes, on the other hand, is directed towards tolerating faults which may occur or predict their possible future manifestation. In our opinion, this view reflects the appearance of dependability means in different phases of an application life cycle.
In the following paragraphs, we attempt to provide a brief introduction to each of the above mentioned means to improve dependability.
- Fault prevention - is usually considered to be a part of the product development phase. It refers to methodologies addressing faults which can occur during the phases of hardware and software design. An example for a method of fault prevention can be the analysis of previous faults in a particular component and accordingly measures to prevent their further occurrence.
-
Fault forecasting - is performed through an evaluation of the system behavior with respect to fault occurrence.
It is used to estimate the present number of faults, their future incidence and their probable causes. The results of forecasting tests are usually expressed in terms of (un)reliability, (un)availability or risk factors. Two approaches to system evaluation can be identified:
- Non-probabilistic or ordinal/qualitative evaluation - this type of evaluation is utilized to classify and rank the failure modes that may lead to a system failure.
- Probabilistic evaluation - in this approach, probabilities are used to express the levels of a particular dependability attribute. These attributes are then used as measures of system dependability.
There is a number of established evaluation techniques available. Some of the well known ones are failure mode and effect analysis (FMEA), failure mode diagnostics and effect analysis (FMDEA), Markov processes and stochastic Petri Nets [??].
- Fault removal - is the means designed to reduce the number and severity of faults in a system. One can differentiate three steps during fault removal processes: (a) verification, which is defined as the process of checking whether the system adheres to system verification conditions. In cases when the results of a system verification do not comply with the ones in the specification, the system undergoes step (b) diagnosis, the intent of which is to detect and identify the faults which cause the failure of the verification process. After faults are diagnosed, the last step (c) correction is undertaken. This includes the necessary correction procedures for removing the effects of faults. The latter two steps are also referred to as a part of the validation process. For further information on verification and validation processes we refer to [???], [???].
- Fault tolerance - is targeted at enabling a system to deliver services
complying with its specification in spite of the presence of system faults. It is
noteworthy that fault tolerance is an inherent part of systems, where the participation
of an external agent in maintenance is hardly affordable. Two methods to perform fault
tolerance are distinguished:
- Error processing - is directed at the removal of errors, before they put a system into failure state. It can be performed in two ways: error recovery and error compensation. Each of these methods also incorporates detection procedures before an error can be processed - hence, the terms error detection and recovery, and error detection and compensation are used. Further elaborating on different error recovery approaches, one can differentiate forward recovery and backward recovery techniques [???].
- Fault treatment - can be viewed as a procedure composed of fault diagnosis and fault deactivation steps. The first step is aimed at detecting and identifying faults and consequently making them to undergo a deactivation process.
Fault tolerance is mainly implemented via redundancy, i.e. introducing additional structures/ sub-systems or structure/ sub-system models which have the same functionality. One of the factors causing the success of the redundancy based approach is the utilization of diverse redundant components. Although diversity imposes its own constraints such as maintenance difficulties, it is usually considered to be a good practice. The reason is that in most cases the same faults have different influences on each component of a redundant system [??]. This ensures that no concurrent failures occur. Figure [???] depicts the relation between fault tolerance techniques (error processing and fault treatment) and redundancy.
As can be seen, fault tolerance can be either based on analytical redundancy (also referred to as model based) or physical redundancy. The idea behind analytical redundancy is to use various mathematical models to achieve the final result, whereas physical redundancy is seen as an implementation of the same functionality on several components. Both of the approaches have their own advantages and disadvantages. For instance, analytical redundancy, sometimes being mathematically challenging and unprecise, offers a cost-effective solution, though. This is contrasted by the physical redundancy approach, which is expensive to implement and hard to maintain but at the same time provides quite reliable results [??], [??]. In this survey, we put emphasis on model-based approaches to fault tolerance, particularly fault diagnosis. The fault diagnosis process is considered as a two step process. It is composed of fault detection and fault identification processes. Figure [???] depicts different methods to fault diagnosis. Although they are called fault detection methods in [??], the big overlap between identification and detection methods urges us to put them into one category, i.e. diagnosis methods. Most of the state of the art approaches to the problem of diagnosis rely on a process model. In this approach, the process behavior is modelled by some well established mathematical techniques, i.e. probabilistic, inductive logic etc. Then the model and the real system are organized according to either reference diagnostics or comparison diagnostic schemes [???]. The output of the diagnostics scheme is a residual whose value carries information on a system state with respect to faults [??]. The details of this approach we will discuss later in state of the art section (Chapter [???]) of our survey.
In addition to the redundancy based approaches to fault tolerance, for the past few years the trend in computational systems, particularly in networked systems, started to move towards the concept of autonomic computing [???]. This concept introduces the idea of self-healing, self-recovering components which are designed to be capable of handling adverse situations by themselves without external supervision [???]. As this field is quite broad on its own, it shall not be discussed in this text. For more information on means to improve dependability we confer to [???], [???].
Another important attribute of dependability, at least from the perspective of robotics, is robustness. The term has been used widely in the scientific community. According to a survey of the Santa Fe Institute 17 different definitions of the word are used in different contexts [??]. In [??], robustness is described as being a specialization of the main attributes of a system. It is defined as the ability of a system to withstand external faults. Chatila et al. [??], [??], [??] relate the concepts of robustness and fault tolerance and transfer them from the computing system domain to the robotics domain. They define robustness as being a superset of fault tolerance and differentiate them on the basis of the faults they address. They define fault tolerance as the method to cope with internal faults related to system resources, whereas robustness is described as a means to handle external faults related to environmental conditions. The authors also suggest a basic taxonomy of methods to robustness in the field of robotics which we broadened with some additional concepts (Figure [??]) [??],[??].
The survey on the state of the art approaches to robustness and fault tolerance in robotics shows that most of the capabilities to handle contingency can be grouped into two classes: (a) low level threat handling and (b) high level threat handling. The low level capabilities are usually inherent in skills and modules which are in most cases responsible for communicating with the hardware and for implementing routine robotic functionalities. On this level, each module or skill can manage its own local conflicts related to its resources or execution processes. On the high level, fault handling is an attribute of the planner, supervisor or executive layers. These entities usually manage the global, long term behavior of a system and are responsible for reasoning on system-environment interaction. We will talk about robustness and fault tolerance capabilities in detail on the example of state of the art robotic software integration systems (SIS) in chapter [???].
This section provided a brief introduction to concepts and terminology, particularly their implications for robotics. Additionally, we described the constituents of a dependable system, i.e. its attributes, threats and means. Definitions and descriptions on the basis of a systematic taxonomy were given for each of them. An emphasis was made on means to improve dependability, specifically fault tolerance and robustness. On the basis of the taxonomies reviewed two new taxonomies on fault tolerance and robustness approaches were created. In the next section the taxonomy on fault tolerance, particularly the diagnosis step will further be elaborated. Additionaly, we will discuss state of the art approaches to fault tolerance in robotics in detail.
[edit] State of the Art
After having explored the abstract level of definitions and terminology relevant to the field, we now proceed to the concrete level of actual techniques for fault diagnosis as an integral step to fault tolerance. This chapter is designed as an attempt to classify most of the frequently used approaches to fault diagnosis in engineering, particularly in control engineering. The classification part is followed by a brief explanation of the theoretical ideas which form the basis of some of these approaches, and their transfer to the field of robotics. In addition, we will try to construct a taxonomy for the diagnosis models which the approaches rely on. Particularly, we will discuss probabilistic models used in robotics and present some contemporary research in this area. Most of the approaches to model-based fault diagnosis in engineering and computer science can be divided into two different classes with respect to the type of models used (Figure [???]). One of the most frequently applied approaches is the process model based diagnosis. It is characterized by the usage of particular process parameters which pertain to the relation between input and output signals. Output observers and parity equations are examples for such approaches to modeling. Sometimes it is not possible to obtain a model of a process explicitly. In such situations, the so called output signal model based diagnosis approach can be used. This allows to approximately infer the relation between input and output signals based on the form of the output signal. Some examples of these modeling types are bandpass filters and spectral analysis [???]. The simple diagram in figure [???], displays an example on how these models are usually integrated into a control system as its diagnostic constituent.
As can be seen in the figure, there are two steps which contribute to the final result of the fault diagnosis procedure. The first step in FDI (Fault Detection and Identification/Isolation) is fault detection. During this, the actual behavior of the system, as being reflected in its output, is compared with the estimated one from the plant model. Together with this plant model, the residual generator calculates the deviation of the actual behavior from the nominal one. It can be a difference of values or any statistical quantity which convey the necessary information on changes in behavior [???]. During the second step (after the residual has been generated), it is evaluated (isolation process) for faults, their probable sources and the decisions to be made [??]. In our view, in robotics it is often difficult to distinguish these two steps. But there are also cases, such as those mentioned by [???], [??], [???], [??] where these two steps can be differentiated explicitly. From the state of the art for fault diagnosis in robotics one can conclude that most of the approaches adopted are process model based and first and foremost relevant to the step of residual generation of the diagnosis. Another important aspect of the methods used in robotics is that they often rely on techniques of either parameter estimation or output observer/output estimation (Figure [??]), [??], [??], [??], [??], [??], [??], [??], [??], [??] and [??]. We attempted to systematically classify some of these commonly used approaches to fault diagnosis (Figure ??). As can be seen in figure [??], model based residual generation, particularly estimation techniques, can be of two types:
- Multiple model based approaches
- Single model based approaches
In section [??] we stated that fault tolerance is mainly implemented via either analytical or physical redundancy (Figure ??). That is why the representation of single model based approaches has rather complementary character and will not be discussed here. We also did not encounter any application based on this approach.
Multiple model techniques are based on the idea of utilizing a variety of plant models which are specifically aligned with respect to the plant's functioning modes, i.e. nominal mode and failure modes. Each model reflects the behavior of the plant only in that specific mode. The plant models used can usually be categorized with respect to whether they are time invariant, also referred to as fixed structure models (FSM), or time dependent, also referred to as variable structure models (VSM) [??], [??], [??], [??], [??]. Whenever a system changes its state, the dynamics it undergoes change as well. The methods based on multiple model adaptive estimators (MMAE)3 do not account for such behavior. That is, the MMAE technique is based on the assumption that system behaviors change immediately and without influence on each other [??], [??], [??]. On the other hand, interactive multiple model estimators (IMME) account for this condition by explicitly modeling mode transitions as homogeneous first order Markov processes [??]. The fixed structure multiple model (FSMM) approach performs reasonably well for systems with only a few number of components and thus only a few number of system states 4. But its performance degrades considerably as the number of modes modeled increases. To address this problem a set of algorithms known as variable structure multiple model (VSMM) estimators were proposed by [??]. The advantage of VSMM based algorithms to FSMM is that VSMM incorporates the best set of models a posteriori based on the sensor measurements. That is, VSMM determines at each time step which model set to use for estimation, using a priori and a posteriori knowledge [??]. This allows it to handle several system mode models. Apart from the knowledge on how many system modes should be modeled in order to achieve the required level of accuracy, it is important to identify the representation formalism to implement them. The choice of representation dictates to what extent the diagnosis model will be accurate and complete. One can distinguish three major representations for diagnosis. Often, all of them are utilized for the implementation of the residual generation step or the residual evaluation step [??]. The following subsections provide a description of each of these representations.
[edit] Neural network based diagnosis techniques
Often a diagnosis system utilizes two types of knowledge, analytical knowledge in form of the residual generation procedure and heuristic knowledge, which is reflected in the residual evaluation step. neural networks (NN) are usually adopted for the residual evaluation step, because they are well applicable in cases where the system model is known to a certain degree and cannot be fully modeled by analytical methods. In these situations, NN are used for pattern recognition engines which identify faults based on the output measurements. There is also work being done on the application of this type of diagnosis models in robotics. For instance in [??], [??], [??] authors use NN in combination with MMAE techniques. The output of a bank
of Kalman filters in form of mahalanobis distance is fed to a pattern recognition engine based on NN. The pattern recognition engine generates a level of confidence for a particular filter. Kalman filters are used for the residual generation
procedure and are defined per system mode. They are used to estimate the output of a system and generate residuals by
comparison with actual measurement data. Authors estimate the robot's velocity and orientation:
z=[VL, VR, Θ]T, z=x=[VL, VR, Θ]T and r=z−z. (1)
Figure [???] the procedure of diagnosis.
Figure 12: The process of diagnosis based on an MMAE algorithm using a bank of Kalman filters for the detection step and NN for the identification step [??]
The mahalanobis distance fed to the NN has the form of
Dis=rTS−1r (2)
where S is the covariance matrix of the residual vector r. A similar approach to FDI is undertaken by [??]. The authors present their approach as a model free diagnosis in the sense that the approach does not use any predictive model of system state variables. The FDI process is integrated with a hybrid architecture, which the authors call a “Thinking Cap” at the core of which lies a fuzzy controller. The controller generates activation levels depending on the context and the behaviors in use. These activation levels are then fed to a monitoring module which performs the diagnosis. Figure [??] shows the internal anatomy of this module. As can be inferred from the figure, the procedure of diagnosis5 is again capitalized in form of two steps, i.e. residual generation and residual evaluation. The first step is supposed to generate features from the input activation signals which can be fed to radial basis NN (RBNN), which act as detection and isolation/identification mechanisms. The authors use two different methods for feature extraction, which they approve for suitability via conducting a series of experiments. The extraction method for the fault detection step is based on the median value of a set of activation signals, whereas for the fault isolation step the energy of a signal is adopted. The energy of the activation signal of the ith behavior is calculated according to
| x | i = |
|
(xi(n))2 (3) |
The output of the extraction step is a feature vector which is then fed to RBNN. Then, the RBNN tries to minimize the error function of the form
| E= |
|
|
(4) |
and consequently the influence of the faults on the system's performance. The problems with NN based approaches are threefold. First, in most cases NN are applicable only to systems with a relatively small number of modes. The reason for this is that when the number of modes increases, more models will be required, as there is, as already mentioned, one filter per mode which is tuned specifically for that mode. This situation may harm the performance of FDI. Second, when the number of models grows, higher dimensional NN will be required which are usually difficult to train. The third problem is that, since most of the systems are dynamic in their nature, it may be difficult to extract features which are not ambiguous, i.e. greatly overlap. This condition imposes difficulties for training NN so that it could, to some degree, identify different faults [??].
Figure 13: The structure of a Monitoring Module, where each submodule relies on NN representation [??]
[edit] Fuzzy logic and rule based diagnosis techniques
Usually the level of precision of the analytical model defines the accuracy of the diagnosis. As most of the processes under surveillance are complex and nonlinear, it is often difficult to obtain precise models. In such situations, fuzzy logic based paradigms may be useful. Fuzzy description allows to handle uncertain knowledge and to define representation on the basis of which the residual evaluation step can be undertaken. This representation can be utilized for rule-based decision making which mimics human decision making processes. Such systems have been broadly researched on in control engineering [??], but we have not encountered diagnosis applications based on fuzzy logic in robotics. On the other hand, there are some FDI approaches which use standard logic formalisms in order to perform decision making similar to the processes in rule-based methods. For instance Steinbauer et al. present a diagnosis approach which is expressed in terms of propositional logic constructs [??], [??], [??]. The method adopted is based on a consistency-based diagnosis, i.e. a procedure, where the system's real behavior models are checked against the nominal behavior models for consistency. If the models are different, the inconsistency is marked as a fault [??], [??], [??]. System modeling is performed on the software level. It is an abstract structural representation of the software architecture in terms of components, connections between them and connection dependencies. Here authors define two types of connection dependencies as
- Strong dependency - if the termination of one component requires the termination of the other component dependent on it.
- Weak dependency - if the termination of the dependent component is not required.
In order to define the runtime behavior of a system, which is an important prerequisite for the FDI, the authors introduce a convention/rule. It states that if all inputs to the model are correct then a software component should produce correct outputs. For instance, in order to describe the behavior of an object tracker (OT), the following proposition can be made: ¬AB(OT) ∧ ok(Firewire) —→ ok (ObjectMeasurement), where AB stands for abnormal and the ok construct is an observer associated with a firewire process [??], [??], [??]. Such an approach is simple to implement, but it does not take into account issues related to uncertainty, noise, state evolution etc.
[edit] Probabilistic diagnosis techniques
This approach to the diagnosis problem, particularly to residual generation, is one of the most widely used. Probabilistic methods are based on the concept of stochastic processes, where processes' or signals' stochastic attributes and behavior are considered. In robotics, probabilistic models became de facto standard during the past decade. They have been utilized for different problems of system state estimation such as localization, decision making and diagnosis. In the domain of diagnosis, the probabilistic methods which are adopted fit to the taxonomy presented in figure [??] and consequently in figure [??]. They are mostly used for the residual generation step of FDI and can usually be categorized as OO (output observer) estimation methods. Most of these estimation techniques rely on some form of Bayes rule. To be precise, they are approximations. Whenever a stochastic system is used, the mathematics involved is often concerned with calculations of a posteriori distributions of processes or signals. But the problem is that there is usually no closed form solution for this problem [??], [??]. Let
|
P(xt, dt|z1 … t)=ηtp(zt|xt, dt)∫ |
|
(5) |
be given, where xt is a state of a continuous variable, dt is a state of
a discrete variable, zt sensor measurements and ηt is normalizing constant
at time t.
There is no closed form solution to the integral in this equation. That is where approximation models come into use, of
which exists a big variety. One can structure the different modifications of these models as shown in figure [??]
6.
All of the estimation models presented here have a recursive nature, i.e. rely on data previously accumulated during the system's state evolution. We will not discuss the details of each method, but briefly describe them for the sake of completeness. For further reading we confer to [??], [??], [??], [??].
[edit] Parametric models
Parametric models, which were initially developed in the control engineering community [??], are now a widely used category of models both in control engineering and robotics. At the core of them lies the assumption that there is a functional form of a posterior distribution which is of Gaussian type. These models can be used to estimate posteriors of both linear processes, which rely on a classical Kalman filter [??], [??], and nonlinear processes, where extended or unscented Kalman filters (EKF and UKF respectively) are applied [??] [??], [??]. In diagnosis, parametric models are usually used to model a particular behavior of a system's mode and are formed into a bank of filters [?/], [??], [??]. The advantage of approaches based on these models is that they are quite simple to implement. Apart from these advantages, there are some shortcomings to these models too, because they are constrained mostly to Gaussian type of processes. Additionally, modeling system modes with a bank of filters may reduce the performance of the system when the number of modeled system modes increases.
[edit] Nonparametric models
Contrary to parametric model based filters, nonparametric model based filters do not rely on fixed functional forms of posterior distribution. Instead, they approximate them with a finite number of values or samples7 [??], [??], [??]. This aspect of nonparametric filters allows to adopt them to a wide range of process types. It has become a trend to use such models for the fault diagnosis of robotic systems in the past few years. For instance, [??], [??], [??] and [??] present a survey on the applicability of Monte Carlo approaches to diagnosis. They regard the diagnosis problem, particularly the fault detection step, as determining the distribution over some discrete state set at each time step. This distribution is equal to the marginal distribution of posterior state probabilities given in form of an equation (??). Similar to varieties of Kalman filters, particle filters also come in several modfications. The reason for this is that classical particle filters have some shortcomings when applied to the problem of online fault diagnosis. Authors define some challenges for online diagnosis problems which are difficult to address only by classical particle filter algorithms [??]. These are
- Very low prior fault probabilities
- Restricted computational resources
- High dimensional state space
- Non-linear stochastic transitions and observations
- Multi modal system behavior
We briefly discuss some modifications of particle filters (Figure ??) which address these challenges by increasing the
efficiency of classical particle filters.
The risk sensitive particle filter (RSPF) approach incorporates the concepts of risk function. Here, the low probability
states (which are most probably failure states) are given high costs whereas non-failure states with high probabilities
are given low costs. So, instead of tracking posterior probabilities, the products of risk function and posteriors are
tracked [??]. This is depicted in the following form,
xt|zt) (6)
where γt is the normalization constant and r(dt) is the risk function. During the operational phase of a system situations can arise, where several states have close probability values and, whenever a fault happens in one of these states, it is difficult to identify which state it is. A version of the particle filter algorithm called variable resolution particle filter can handle such situations. It is performed by encapsulating similar physical states into one abstract state. Figure [??] depicts such representation.
Figure 15: A Markov model representation of a system. Here (a), (b) and (c) show different levels of state abstractions [??]
Generally, if a fault happens in one of these states, the abstract state is tracked first, then the resolution of the state space is changed from the high level abstract to the ones it encapsulates [??], [??].
But in some cases it still might be difficult to diagnose faults with reliance on present data. That is, additional
information may be required, such as past or future state information. For this purpose authors rely on a combination
of UKF and particle filters. UKF is utilized to approximate procedures one step ahead. This is done as follows: Instead
of using a proposal distribution of the form pN(xt,
dt|z … t), which is needed to instantiate particles, a modified version is
employed, which is based on first order Markov process assumptions, which hold that present states are dependent on the
states one time step behind. It is given in the form p(xt,
dt|x[i]t−1,
d[i]t−1, zt). The distribution presented by
this expression is called optimal proposal distribution, it minimizes the variance of the importance weights [??],
[??]. As indicated before, it is problematic to instantiate samples from this distribution, that is why an approximation
is needed, which is performed by UKF. After the proposal distribution is approximated, the samples are drawn [??],
[??]. This approach can be categorized as a hybrid model approach in the sense that it makes use of both parametric and
nonparametric models.
A similar approach to diagnosis is carried out in [??], [??], where authors investigate Rao-Blackwellised and unscented
particle filters for rover online diagnosis.
Another interesting approach, which is also based on the concept of particle filters, is presented in [??]. Usually, an
algorithm is based on a particular model, which is often constrained by a specific set of assumptions. For instance, an
odometry motion model for localization does assume that a robot will not encounter obstacles. The authors propose to make
such model assumptions explicit and build a model abstraction hierarchy. They introduce a mixed-abstraction particle
filter algorithm which uses such a hierarchy to direct computational resources to the most efficient model whose
assumptions are met. They regard their approach as being similar to the one presented in [??]. For them the difference lies
in the fact that they switch between overlapping system models for certain parts of state space. That is why the model
hierarchy is based on efficiency differences and explicit model assumptions. The model abstraction hierarchies are
represented as acyclic graphs with various system models as nodes and model assumptions as node transitions (Figure ??).
Figure 16: An abstract model hierarchy in the mixed abstract particle filter algorithm [??].
The model assumptions/transitions are defined as binary functions on the state space of an unrestricted/general model. The authors introduce a means to check the validity of an assumption, which they describe as
| ϑt(A)= |
|
(7) |
where A is a binary assumption function and ϑ is a validity value of an assumption. A mixed abstraction particle filter estimates the system state by running several filters in parallel, each using a different system model. The decision, which model to emphasize, is taken on the basis of the estimated validity of model assumptions. If the validity of an assumption is below a predefined threshold, the number of particles in the more complex model is increased otherwise the simpler model is preferred.
[edit] Other approaches to diagnosis
Apart from the techniques described so far, there are also some approaches to diagnosis which either use some variation of the techniques depicted or introduce ideas of their own. For instance in [??], the author develops a whole subsystem of a robotics architecture which not only performs diagnosis of faults but can also provide some recovery action. The approach mainly considers hardware faults of a robotic system. In order to handle them, a set of fault tolerance processes is written for each hardware component. The author emphasizes the different levels of a system and the failures on each respective level. Three levels are defined, being hardware level, low level control (where sensor information is processed) and high level control (the robot's behavior). The approach emphasizes that errors should be confined to the level in which they originate, otherwise the higher levels should compensate for them. The author introduces different fault tolerance techniques on each level of abstraction. For instance, on the lowest level fault tolerance can be achieved by physical redundancy. On the highest level, the technique relies on redundant control strategies. A performance model exists for each strategy, and a failure is detected, if the actual behaviors do not match with the ones modeled. In this case, another strategy is employed. It is underlined that the redundant strategy approach handles the symptoms of a failure and not the cause. The approach is suitable only when faults propagate up the abstraction levels. On the low level control, the concept of virtual sensors is introduced. Virtual sensors are characterized as processes that are responsible for a robot's interaction with its environment based on sensor information. Adaptive virtual sensors are used and are responsible for handling sensor failures. They do so by ignoring the information from the broken sensors. There are four stages of fault tolerance defined: error detection, masking, recovery and reintegration. They are all implemented as concurrent processes.
- Error detection - is carried out by comparing the behavior of a leg with the “plausible leg” behavior, which is actually the model according to which the leg should perform. Another approach introduces a level of confidence for sensors (the confidence level is defined by a pain level associated with a sensor). The sensor values are compared with the ones provided by sensor models8. Then a sort of automaton is defined for a state and its correspondent sensor reading. That means, if the actual sensor reading does not comply with the automaton model, a failure is considered to have happened. The sensor monitors perform this action of comparison and failure detection.
- Masking - is performed by virtual sensors. It removes the affect of local failures. It is dependent on the pain level of the respective sensor and disregards the information from it in case its pain level indicates that it failed.
-
The purpose of a recovery process is to return the system to its operational state once a component failed.
It is defined in three forms as retry - for transient faults, dynamic recalibration - for sensor
drift and reconfiguration - for permanent failures.
- Retry - the sensor is retried several times before it is considered to be faulty. The pain level allows this, because it is set in such a way that it reaches an unacceptable level only after consequent errors.
- Dynamic recalibration - processes are associated with each sensor. These processes update the reference sensor models. This assures that the real sensor behavior is evaluated with respect to the corrected up to date models.
- Reconfiguration is dependent on virtual sensors and the number of real sensors they are associated with. If a sensor fails, the information coming from its virtual sensor is disregarded. To achieve this, virtual sensors should be reconfigured to form a new set of information.
- Reintegration - the purpose of this step of fault tolerance is to reincorporate repaired components, sensors etc. The sensor is defined to be repaired depending on the pain level associated with it (basically it is not real repair, just in some cases the sensor may fail for a time and then function again). As soon as monitor and consensus processes 9 inhibit the pain level, the injury process10 informs the virtual sensor to reincorporate the repaired sensor.
All the explications so far are associated with sensor failures which are referred to by authors as local failures. When a failure is associated with actuators, though, it is considered to be global. The failure of actuators is handled in the same way as the failure of sensors. It is achieved by faking sensor failures when actuators fail and by undergoing the standard procedure of fault tolerance. Contrary to the recovery action in the local failure mode, global failure is based on lesion mechanisms. It is of similar semantics as the pain level for sensors. Thus, the lesion behavior is activated or deactivated according to the level of lesions.
The authors claim that the fault tolerance approach presented by them can handle individual failures, concurrent failures, and accumulative failures. In case of sensor failures, these are decalibration, erroneous readings and permanent failures. The problem with this approach is related to the hardware models adopted. Since the models are obtained on the basis of experimental data, one cannot count on them in all conditions. That is, it can be that the sensors behave completely different to the model even if they are not faulty. Another constraint of this approach is that it is tied to a subsumption architecture, and it is not clear whether it is extendable to other architectures or not.
Another approach, which is one of the successful diagnosis applications, is NASA's Remote Agent Experiment (RAX) [??],
[??]. The idea behind the RAX application is a reactive model-based programming (RMBP) paradigm [??], [??]. There are two
main constituents in the model based reactive control system: A model-based reactive controller, also referred to
as model-based executive, and a reactive model based program, which is executed by the executive. The
approach relies on the concept of executive specification languages such as Esterel and Statecharts. The authors developed
a reactive model-based programming language (RMPL) which not only enables the specification of the execution but also is
“state aware” (in the sense that it allows a developer to operate on hidden states, i.e. non observable or non
controllable states). The output of the RMPL is an automaton with additional states, which otherwise are hidden, the RMPL
itself being composed of two parts:
- The control program - a collection of standard programming constructs to specify the desired behavior of a system.
- The plant model - a representation of the plant's nominal behavior, its common states and transitions.
This modeling formalism is referred to as probabilistic concurrent constraint automata. This approach to modeling allows to derive an abstract model of a system in form of interacting components, where the evolutions of the components' nominal state are represented as a probabilistic constraint automaton (PCA). PCA allows to model states, transitions between them and constraints on variables in each state. A model based program is executed by a model-based executive. It generates a sequence of actions which take a system to the specified states and configurations goals. The executive is composed of two parts, the control sequencer and the deductive controller. The control sequencer is responsible for generating a sequence of configuration goals using control program and plant state estimates. The deductive controller is meant to estimate the plant's current mode and to issue commands which take the system to the configuration goals.
The deductive controller is composed of mode estimation and mode reconfiguration. Mode estimation tracks the state trajectories of a system and their consistency with the plant model. The estimation is based on a Hidden Markov Model (HMM) enhanced with a conflict directed A* algorithm. The A* helps to reduce the state space of a system program by enumerating the state trajectories in the order of likelihood. Mode reconfiguration takes the output of ME, the current system state, and the configuration goals in order to compute the sequence of commands which take the system to a configuration goal through the most cost effective path. It achieves this with the help of, on the one hand, a goal interpreter, which it determines intermediate states reachable from the current state and can attain the goal state while performing this in a cost effective way based on conflict directed A*. On the other hand, a reactive planner is used in this operation, which generates and executes a sequence of commands which take a system to its goal state. The commands are executed one at a time and results are confirmed with ME. There are different varieties of model based executives such as Livingstone, Titan, Moriarty and Kirk [??], [??], [??], [??]. Another approach which introduces a concept of process monitors in combination with probabilistic models is presented in [??]. Here the authors use a set of monitors to gather information on the state of a system. They define the role of monitors as to detect the symptoms, where a symptom is defined as a significant difference between an observed state of the world and an expectation with respect to the nominal model/situation. A monitor informs the decision maker whenever the monitored condition is met. The monitors are implemented as independent processes using TCA constructs [??]. A supervisor, which performs diagnosis and recovery tasks, is based on Partially Observable Markov Decision Processes (POMD). POMDP is completely defined as a sixtuple of the form
Ξ ≡ (S,A,Z,T,O,R) (8)
where
- S is a finite set of robot states
- A is a finite set of robot actions
- Z is a finite set of robot observations
- T:SxA—→Π(S) is the state transition function
- O:SxA—→Π(Z) is the observation function
- R:SxAxS—→R is the reward function
The authors classify possible non-nominal situations and tie them to actions to be performed. The uncertainty of the decision making process is accounted for by POMDP. Figure [??] depicts the process of decision making.
The problem of this approach is that one should account for all the possible failure modes which may explode the state
space of a problem [??].
The aim of this chapter was to present different approaches to fault tolerance with an emphasis on fault diagnosis. We
attempted to create a taxonomy for these methods (Figure ??) and relate it to taxonomies discussed in the subsection on
fault tolerance of the previous chapter (Figure ??). We reviewed different multiple model algorithms, i.e. FMMA and VSMM
and their adaptive and interactive versions. Additionally, we presented three main categories of representation formalisms
which can be used to implement these algorithms. In addition, the advantages and disadvantages of each approach were investigated. On the one hand, it would be naive to judge the state of the art only from the articles we reviewed here. But on the other hand, we believe that having learnt from past experiences, robotics is currently increasingly adopting approaches based on probabilistic models. These models seem to be suitable considering the fact that a robot is a highly dynamic system which often interacts with even more dynamic and uncertain environments.
The next section will be concerned with Robotic Software Integration Systems (SIS) and approaches used to make them
dependable. We will also conduct an analysis on how various techniques (including the ones discussed here) to achieve
robustness and fault tolerance are integrated with such systems.
