Determination of the failure rate. Network electronic scientific journal "system engineering"

Annotation: Two types of means of maintaining high availability are considered: ensuring fault tolerance (failover, survivability) and ensuring safe and quick recovery after failures (serviceability).

Availability

Basic concepts

The information system provides its users with a certain set of services (services). It is said that the desired level of availability of these services is provided if the following indicators are within the specified limits:

  • Service Efficiency. The efficiency of a service is defined in terms of the maximum request service time, the number of supported users, and so on. It is required that the efficiency does not fall below a predetermined threshold.
  • Unavailable time. If the effectiveness of the information service does not satisfy the imposed restrictions, the service is considered unavailable. It is required that the maximum duration of the period of unavailability and the total unavailable time for a certain period (month, year) did not exceed predetermined limits.

In essence, it is required that the information system almost always work with the desired efficiency. For some critical systems (e.g. control systems) unavailable time should be zero, without any "almost". In this case, one speaks of the probability of an unavailability situation occurring and requires that this probability does not exceed a given value. To solve this problem, special fault-tolerant systems which are usually very expensive.

The vast majority of commercial systems are subject to less stringent requirements, however, modern business life imposes rather severe restrictions here, when the number of users served can be measured in thousands, the response time should not exceed a few seconds, and unavailable time- a few hours a year.

The task of providing high availability needs to be solved for modern configurations built in client/server technology. This means that the entire chain needs to be protected - from users (possibly remote) to critical important servers(including security servers).

The main threats to accessibility were considered by us earlier.

In accordance with GOST 27.002, under refusal is understood as an event that consists in a violation of the product's operability. In the context of this work, a product is an information system or its component.

In the simplest case, it can be considered that failures of any component of a composite product lead to a total failure, and the distribution of failures over time is a simple Poisson flow of events. In this case, the concept failure rate and , which are related by the relation

where is the component number,

failure rate,

– .

Failure rates independent components add up:

A mean time between failures for a composite product is given by the ratio

Already these simple calculations show that if there is a component, failure rate which is much more than the rest, then it is he who determines mean time between failures all information system. This is the theoretical justification for the principle of strengthening the weakest link first.

The Poisson model makes it possible to substantiate another very important proposition, which is that the empirical approach to building systems high availability cannot be implemented in a reasonable amount of time. With a traditional test/debug cycle software system according to optimistic estimates, each error correction leads to an exponential decrease (by about half a decimal order) failure rate. It follows from this that in order to verify experimentally that required level availability, regardless of the testing and debugging technology used, will have to spend time almost equal to mean time between failures. For example, to achieve mean time between failures 10 5 hours will take more than 10 4.5 hours, which is more than three years. This means that other methods of building systems are needed. high availability, methods whose effectiveness has been proven analytically or practically over more than fifty years of development computer science and programming.

The Poisson model is applicable in cases where the information system contains single points of failure, that is, components whose failure leads to the failure of the entire system. A different formalism is used to study redundant systems.

In accordance with the problem statement, we will assume that there is a quantitative measure of the effectiveness of the information services provided by the product. In this case, the concepts performance indicators individual elements and the efficiency of the functioning of the entire complex system.

As a measure of accessibility, one can take the probability of acceptability of the effectiveness of the services provided by the information system throughout the considered period of time. The greater the margin of efficiency the system has, the higher its availability.

If there is redundancy in the system configuration, the probability that in the considered period of time efficiency of information services does not fall below the allowable limit depends not only on the probability of failure of the components, but also on the time during which they remain inoperative, since in this case the total efficiency drops, and each subsequent failure can become fatal. To maximize system availability, you must minimize the downtime of each component. In addition, it should be borne in mind that, generally speaking, repair work may require a decrease in efficiency or even a temporary shutdown of healthy components; this kind of influence also needs to be minimized.

A few terminological remarks. Usually in the literature on reliability theory, instead of availability, they talk about readiness(including high readiness). We have preferred the term "accessibility" to emphasize that information service should not just be “ready” on its own, but be available to its users in conditions where situations of unavailability can be caused by reasons that, at first glance, are not directly related to the service (for example, the lack of consulting services).

Next, instead of unavailability time usually talk about availability factor. We wanted to pay attention to two indicators - the duration of a single downtime and the total duration of downtime, so we preferred the term " unavailable time"as more capacious.

Fundamentals of High Availability Measures

The basis for measures to improve accessibility is the use of a structured approach, embodied in an object-oriented methodology. Structuring is necessary in relation to all aspects and constituent parts information system - from architecture to administrative databases, at all stages of its life cycle– from initiation to decommissioning. Structurization, important in itself, is at the same time necessary condition the feasibility of other accessibility measures. Only small systems can be built and operated as you please. Large systems have their own laws, which, as we have already indicated, programmers first realized more than 30 years ago.

When developing measures to ensure high availability

Part 1.

Introduction
The development of modern equipment is characterized by a significant increase in its complexity. The complication causes an increase in the guarantee of the timeliness and correctness of solving problems.
The problem of reliability arose in the 50s, when the process of rapid complication of systems began, and new objects began to be put into operation. At that time, the first publications appeared that defined the concepts and definitions related to reliability [1], and a technique for assessing and calculating the reliability of devices using probabilistic-statistical methods was created.
The study of the behavior of the equipment (object) during operation and the assessment of its quality determines its reliability. The term "exploitation" comes from the French word "exploitation", which means to benefit or benefit from something.
Reliability is the property of an object to perform the specified functions, keeping the values ​​of the established performance indicators within the specified limits over time.
To quantify the reliability of an object and to plan operation, special characteristics are used - reliability indicators. They allow assessing the reliability of an object or its elements in various conditions and at different stages of operation.
More details about reliability indicators can be found in GOST 16503-70 - "Industrial products. Nomenclature and characteristics of the main reliability indicators.", GOST 18322-73 - "Maintenance and repair systems for equipment. Terms and definitions.", GOST 13377-75 - "Reliability in engineering. Terms and definitions".

Definitions
Reliability- property [hereinafter - (self-in)] of the object [hereinafter - (OB)] to perform the required functions, maintaining their performance for a given period of time.
Reliability is a complex property that combines the concept of performance, reliability, durability, maintainability and safety.
performance- represents the state of the OB, in which it is able to perform its functions.
Reliability- the ability of OB to maintain its performance for a certain time. An event that disrupts the operation of the OB is called a failure. A self-recovering failure is called a failure.
Durability- the ability of OB to maintain its performance to the limit state, when its operation becomes impossible for technical, economic reasons, safety conditions or the need for major repairs.
maintainability- determines the adaptability of the OB to the prevention and detection of malfunctions and failures and their elimination by carrying out repairs and maintenance.
Persistence- Svo-in ON continuously maintain their performance during and after storage and maintenance.

Key Reliability Indicators
The main qualitative indicators of reliability are the probability of failure-free operation, failure rate and mean time to failure.
Probability of uptime P(t) is the probability that within a specified period of time t, OB failure will not occur. This indicator is determined by the ratio of the number of OB elements that have worked without fail until the moment of time t to the total number of OB elements that are operational at the initial moment.
Failure rate l(t) is the number of failures n(t) OB elements per unit of time, referred to the average number of elements Nt OB operational at the time Dt:
l (t) \u003d n (t) / (Nt * D t) , Where
D t- a given period of time.
For example: 1000 OB elements worked 500 hours. During this time, 2 elements failed. From here, l (t) \u003d n (t) / (Nt * D t) \u003d 2 / (1000 * 500) \u003d 4 * 10 -6 1/h, i.e. 4 elements out of a million can fail in 1 hour.
Component failure rates are taken based on reference data [1, 6, 8]. For example, the failure rate is given in l(t) some elements.

Element name

Failure rate, *10 -5, 1/h

Resistors

Capacitors

transformers

Inductors

Switching devices

Solder connections

Wires, cables

Electric motors


Reliability of OB as a system is characterized by a flow of failures L, numerically equal to the sum of the failure rate of individual devices:
L = ål i
The formula calculates the flow of failures and individual OB devices, which, in turn, consist of various nodes and elements characterized by their failure rate. The formula is valid for calculating the flow of system failures from n elements in the case when the failure of any of them leads to the failure of the entire system as a whole. Such a connection of elements is called logically sequential or basic. In addition, there is a logically parallel connection of elements, when the failure of one of them does not lead to the failure of the system as a whole. Probability-of-failure relationship P(t) and bounce flow L defined:
P(t)= exp(-Dt) , it's obvious that 0 AND 0< P (t )<1 And p(0)=1, A p(¥)=0
MTBF To is the mathematical expectation of operating time of OB before the first failure:
To=1/ L =1/(ål i) , or, from here: L=1/To
Uptime is equal to the reciprocal of the failure rate.
For example : element technology ensures medium failure rate l i \u003d 1 * 10 -5 1 / h . When used in OB N=1*10 4 elementary parts total failure rate l o= N * l i \u003d 10 -1 1 / h . Then the mean time of failure-free operation of the OB To \u003d 1 / l o \u003d 10 h. If you perform OB on the basis of 4 large integrated circuits (LSI), then the average time of failure-free operation of OB will increase by N / 4 = 2500 times and will be 25000 hours or 34 months or about 3 years.

Reliability calculation
The formulas make it possible to calculate the reliability of the OB if the initial data are known - the composition of the OB, the mode and conditions of its operation, the failure rate of its components (elements). However, there are difficulties in practical calculations of reliability due to the lack of reliable data on the failure rate for the range of elements, assemblies and devices of OB. The way out of this situation is the use of the coefficient method. The essence of the coefficient method is that when calculating the reliability of OB, non-absolute values ​​of the failure rate are used l i, and the reliability coefficient ki linking values l i with bounce rate lb any basic element
ki = l i / l b
Reliability factor ki practically does not depend on the operating conditions and is a constant for this element, and the difference in operating conditions ku taken into account by the corresponding changes lb. A resistor is chosen as a basic element in theory and practice. The component reliability indicators are taken based on reference data [1, 6, 8]. For example, the reliability coefficients are given ki some elements. In table. 3 shows the coefficients of operating conditions ku work for some types of equipment.
The influence on the reliability of the elements of the main destabilizing factors - electrical loads, ambient temperature - is taken into account by introducing correction factors into the calculation a. In table. 4 shows the coefficients of the conditions a works for some types of elements. Accounting for the influence of other factors - dustiness, humidity, etc. - performed by correcting the failure rate of the base element using correction factors.
The resulting reliability coefficient of the OB elements, taking into account the correction factors:
ki"=a1*a2*a3*a4*ki*ku, Where
ku- nominal value of the coefficient of operating conditions
ki- nominal value safety factor
a1- coefficient taking into account the influence of the electrical load on U, I or P
a2- coefficient taking into account the influence of the medium temperature
a3- coefficient of load reduction from the nominal one according to U, I or P
a4- coefficient of use of this element, to the work of the OB as a whole

terms of Use

Condition coefficient

Laboratory conditions

Stationary equipment:

indoors

Outdoors

Mobile equipment:

ship

Automotive

train

Name of the element and its parameters

load factor

Resistors:

By voltage

By power

Capacitors

By voltage

By reactive power

For direct current

For reverse voltage

By transition temperature

By collector current

By voltage collector-emitter

By power dissipation

The calculation procedure is as follows:
1. Determine the quantitative values ​​of the parameters that characterize the normal operation of the OB.
2. Make up an element-by-element OB circuit diagram that determines the connection of elements when they perform a given function. Auxiliary elements used in the execution of the OB function are not taken into account.
3. Initial data for reliability calculation are determined:

  • type, quantity, nominal data of elements
  • operating mode, ambient temperature and other parameters
  • element utilization factor
  • system service factor
  • the base element is defined lb and failure rate lb"
  • according to the formula: ki "= a 1* a 2* a 3* a 4* ki * ku the reliability factor is determined

4. The main indicators of OB reliability are determined, with a logically sequential (main) connection of elements, nodes and devices:

  • probability of failure: P(t)=exp(-l b*To*) , Where
    Ni - number of identical elements in OB
    n is the total number of elements in the OB that have the main connection
  • time to failure:
    To=1/(l b*)

If there are sections with parallel connection of elements in the OB scheme, then first the reliability indicators are calculated separately for these elements, and then for the OB as a whole.
5. The found reliability indicators are compared with the required ones. If they do not correspond, then measures are taken to improve the reliability of OB ().
6. Means of improving the reliability of OB are:
- the introduction of redundancy, which happens:

  • intra-element - the use of more reliable elements
  • structural - redundancy - common or separate

Calculation example:
Let's calculate the main reliability indicators for a fan on an asynchronous electric motor. The diagram is shown on. To start M close QF, and then SB1. KM1 receives power, works, and with its contacts KM2 connects M to the power source, and shunts SB1 with an auxiliary contact. To turn off M is SB2.

Protection M uses FA and thermal relay KK1 with KK2. The fan operates indoors at T=50 C in continuous mode. For the calculation, we apply the coefficient method, using the reliability coefficients of the circuit components. We accept the failure rate of the basic element l b \u003d 3 * 10 -8. Based on the circuit diagram and its analysis, we will draw up the main circuit for calculating reliability (). The calculation scheme includes components, the failure of which leads to a complete failure of the device. The initial data will be summarized in .

Base element, 1/h

l b

3*10 -8

Coef. operating conditions

Failure rate

l b ’

l b * ku \u003d 7.5 * 10 -8

Working time, h

Circuit diagram element

Design model element

Number of elements

Coef. reliability

Coef. loads

Coef. electrical load

Coef. temperature

Coef. power loads

Coef. use

The product of the coefficient a

Coef. reliability

S (Ni*ki')

Time to failure, h

1/[ l b ’* S (Ni*ki’)]=3523.7

Probability

e [- l b '*To* S (Ni*ki')] \u003d 0.24

Based on the calculation results, the following conclusions can be drawn:
1. Time to failure of the device: To=3524 h.
2. Probability of failure-free operation: p(t)=0.24. The probability that within a given operating time t, under given operating conditions, no failure will occur.

Particular cases of reliability calculation.

1. The object (hereinafter OB) consists of n blocks connected in series (). The probability of failure-free operation of each block p. Find the probability of failure-free operation P of the system as a whole.

Solution: P = p n
2. OB consists of n blocks connected in parallel (). The probability of failure-free operation of each block p. Find the probability of failure-free operation P of the system as a whole.

Solution: P =1-(1- p ) 2
3. OB consists of n blocks connected in parallel (). The probability of failure-free operation of each block p. The probability of failure-free operation of the switch (P) p1. Find the probability of failure-free operation P of the system as a whole.

Solution: P=1-(1-p)*(1-p1*p)
4. OB consists of n blocks (), with the probability of failure-free operation of each block p. In order to increase the reliability of the OB, duplication was made, with the same blocks. Find the probability of failure-free operation of the system: with duplication of each block Pa, with duplication of the entire system Pb.

Solution: Pa = n Pb = 2
5. OB consists of n blocks (see Fig. 10). With serviceable C, the probability of failure-free operation is U1=p1, U2=p2. If C is faulty, the probability of failure-free operation is U1=p1", U2=p2". Probability of failure-free operation C=ps. Find the probability of failure-free operation P of the system as a whole.

Solution: P = ps *+(1- ps )*
9. OB consists of 2 nodes U1 and U2. Probability of non-failure operation during t nodes: U1 p1=0.8, U2 p2=0.9. After the time t has elapsed, the OB is faulty. Find the probability that:
- H1 - node U1 is faulty
- H2 - node U2 is faulty
- H3 - nodes U1 and U2 are faulty
Solution: H0 obviously occurred when both nodes are healthy.
Event A=H1+H2+H3
Prior (initial) probabilities:
- P(H1)=(1-p1)*p2=(1-0.8)*0.9=0.2*0.9=0.18
- P(H2)=(1-p2)*p1=(1-0.9)*0.8=0.1*0.8=0.08
- P(H3)=(1-p1)*(1-p2)=(1-0.8)*0.9=0.2*0.1=0.02
- A= i=1 å 3 *P(Hi)=P(H1)+P(H2)+P(H3)=0.18+0.08+0.02=0.28
A posteriori (final) probabilities:
- P(H1/A)=P(H1)/A=0.18/0.28=0.643
- P(H2/A)=P(H2)/A=0.08/0.28=0.286
- P(H3/A)=P(H3)/A=0.02/0.28=0.071
10. OB consists of m blocks of type U1 and n blocks of type U2. Probability of failure-free operation during time t of each block U1=p1, each block U2=p2. For the OB to work, it is sufficient that during t any 2 blocks of the U1 type and at the same time any 2 blocks of the U2 type work without fail. Find the probability of failure-free operation of the OB.
Solution: Event A (non-failure operation of the OB) is the product of 2 events:
- A1 - (at least 2 out of m units of type U1 are working)
- A2 - (at least 2 out of n units of type U2 are working)
The number X1 of fail-safe blocks of type U1 is a random variable distributed according to the binomial law with parameters m, p1. The event A1 is that X1 will take on a value of at least 2, so:

P(A1)=P(X1>2)=1-P(X1<2)=1-P(X1=0)-P(X1=1)=1-(g1 m +m*g2 m-1 *p1), where g1=1-p1

likewise : P(A2)=1-(g2 n +n*g2 n-1 *p2), where g2=1-p2

Probability of non-failure operation of OB:

R=P(A)=P(A1)*P(A2)= * , where g1=1-p1, g2=1-p2

11. OB consists of 3 nodes (). Node U1 has n1 elements with failure rate l1. Node U2 has n2 elements with failure rate l2. Node U3 has n3 elements with failure rate l2, because U2 and U3 duplicate each other. U1 fails if at least 2 elements fail in it. U2 or U3 because are duplicated, fail if at least one element fails in them. OB fails if U1 or U2 and U3 fail together. The probability of failure-free operation of each element p. Find the probability that the OB will not fail in time t.
The failure probabilities of U 2 and U 3 are:

R2=1-(1-p2) n2 R3=1-(1-p3) n3

The probabilities of failure of the entire OB:
R=R1+(1-R1)*R2*R3

Literature:

  • Malinsky V.D. and others. Testing of radio equipment, Energia, 1965
  • GOST 16503-70 - "Industrial products. Nomenclature and characteristics of the main reliability indicators".
  • Shirokov A.M. Reliability of radio electronic devices, M, Higher School, 1972
  • GOST 18322-73 - "Maintenance and repair systems for equipment. Terms and definitions".
  • GOST 13377-75 - "Reliability in engineering. Terms and definitions".
  • Kozlov B.A., Ushakov I.A. Handbook for calculating the reliability of radio electronics and automation equipment, M, Owls. Radio, 1975
  • Perrote A.I., Storchak M.A. Questions of reliability of REA, M, Sov. Radio, 1976
  • Levin B.R. Theory of reliability of radio engineering systems, M, Sov. Radio, 1978
  • GOST 16593-79 - "Electric drives. Terms and definitions".

I. Bragin 08.2003

Failure rate is the ratio of the number of failed samples of equipment per unit of time to the average number of samples that are working properly in a given period of time, provided that the failed samples are not restored and are not replaced by serviceable ones.

This characteristic is denoted. According to the definition

where n(t) is the number of failed samples in the time interval from to ; - time interval - the average number of properly working samples in the interval ; N i - the number of properly working samples at the beginning of the interval , N i +1 - the number of properly working samples at the end of the interval .

Expression (1.20) is a statistical definition of the failure rate. For the probabilistic representation of this characteristic, we establish the relationship between the failure rate, the probability of failure-free operation and the failure rate.

Let us substitute into expression (1.20) the expression for n(t) from formulas (1.11) and (1.12). Then we get:

.

Taking into account expression (1.3) and the fact that N ср = N 0 – n(t), we find:

.

Going to zero and going to the limit, we get:

. (1.21)

Integrating expression (1.21), we obtain:

Since , then on the basis of expression (1.21) we get:

. (1.24)

Expressions (1.22) - (1.24) establish the relationship between the probability of failure-free operation, failure rate and failure rate.


Expression (1.23) can be a probabilistic definition of the failure rate.

The failure rate as a quantitative characteristic of reliability has a number of advantages. It is a function of time and allows you to visually establish the characteristic areas of the equipment. This can significantly improve the reliability of the equipment. Indeed, if the run-in time (t 1) and the end time of work (t 2) are known, then it is possible to reasonably set the equipment training time before the start of its expiration.

operation and its resource before repair. This makes it possible to reduce the number of failures during operation, i.e. leads, ultimately, to an increase in the reliability of the equipment.

The failure rate as a quantitative characteristic of reliability has the same drawback as the failure rate: it allows one to quite simply characterize the reliability of the equipment only up to the first failure. Therefore, it is a convenient characteristic of the reliability of single-use systems and, in particular, the simplest elements.

According to the known characteristic, the remaining quantitative characteristics of reliability are most simply determined.

These properties of the failure rate allow it to be considered the main quantitative characteristic of the reliability of the simplest elements of radio electronics.

1.1 Probability of failure-free operation

The probability of failure-free operation is the probability that under certain operating conditions, within a given operating time, no failure will occur.
The probability of failure-free operation is denoted as P(l) , which is determined by formula (1.1):

Where N 0 - number of elements at the beginning of the test;r(l) - the number of failures of elements by the time of operation.It should be noted that the larger the valueN 0 , the more accurately you can calculate the probabilityP(l).
At the beginning of operation of a serviceable locomotive P(0) = 1, since during the run l= 0, the probability that no element fails takes the maximum value - 1. With increasing mileage l probability P(l) will decrease. In the process of approaching the service life to an infinitely large value, the probability of failure-free operation will tend to zero P(l→∞) = 0. Thus, in the process of operating time, the value of the probability of no-failure operation varies from 1 to 0. The nature of the change in the probability of no-failure operation as a function of mileage is shown in fig. 1.1.

Fig.2.1. Graph of change in the probability of failure-free operation P(l) depending on work

The main advantages of using this indicator in calculations are two factors: firstly, the probability of failure-free operation covers all factors that affect the reliability of the elements, allowing you to simply judge its reliability, because. the larger the valueP(l), the higher the reliability; secondly, the probability of failure-free operation can be used in reliability calculations of complex systems consisting of more than one element.

1.2 Probability of failure

The probability of failure is the probability that, under certain operating conditions, at least one failure will occur within a given operating time.
The failure probability is denoted as Q(l), which is determined by formula (1.2):

At the beginning of operation of a serviceable locomotiveQ(0) = 0, since during the runl= 0 the probability that at least one element will fail takes the minimum value - 0. With increasing mileagelfailure probabilityQ(l) will increase. In the process of approaching the service life to an infinitely large value, the probability of failure will tend to unityQ(l→∞ ) = 1. Thus, in the process of operating time, the value of the probability of failure varies from 0 to 1. The nature of the change in the probability of failure in the run function is shown in fig. 1.2. The probability of failure-free operation and the probability of failure are opposite and incompatible events.

Fig.2.2. Graph of change in the probability of failure Q(l) depending on work

1.3 Failure rate

The failure rate is the ratio of the number of elements per unit of time or mileage, divided by the initial number of elements tested. In other words, the failure rate is an indicator that characterizes the rate of change in the probability of failures and the probability of failure-free operation as the duration of work increases.
The failure rate is denoted as and is determined by formula (1.3):

where is the number of failed elements for the run interval.
This indicator allows you to judge by its value the number of elements that will fail at some period of time or mileage, and also by its value you can calculate the number of required spare parts.
The nature of the change in the frequency of failures in the mileage function is shown in fig. 1.3.


Rice. 1.3. Graph of change in the frequency of failures depending on the operating time

1.4 Failure rate

The failure rate is the conditional density of the occurrence of an object failure, determined for the considered point in time or operating time, provided that up to this point the failure has not occurred. Otherwise, the failure rate is the ratio of the number of failed elements per unit of time or mileage to the number of properly working elements in a given period of time.
The failure rate is denoted as and is determined by formula (1.4):

Where

Typically, the failure rate is a non-decreasing function of time. The failure rate is usually used to assess the propensity for failures at various points in the operation of objects.
On fig. 1.4. the theoretical nature of the change in the failure rate as a function of the run is presented.

Rice. 1.4. Graph of change in the failure rate depending on the operating time

On the graph of the change in the failure rate, shown in fig. 1.4. It is possible to single out three main stages reflecting the process of operation of an element or an object as a whole.
The first stage, also called the burn-in stage, is characterized by an increase in the failure rate during the initial period of operation. The reason for the increase in the failure rate at this stage is hidden manufacturing defects.
The second stage, or the period of normal operation, is characterized by the tendency of the failure rate to a constant value. During this period, random failures may occur due to the appearance of a sudden load concentration that exceeds the ultimate strength of the element.
The third stage, the so-called period of forced aging. It is characterized by the occurrence of wear failures. Further operation of the element without its replacement becomes economically unsustainable.

1.5 Mean time to failure

Mean time to failure is the average mileage between failures of an element before failure.
The mean time to failure is denoted as L 1 and is determined by formula (1.5):

Where l i- time to failure of the element; r i- number of failures.
The mean time to failure can be used to preliminarily determine the timing of the repair or replacement of the element.

1.6 Mean value of the failure rate parameter

The average value of the failure flow parameter characterizes the average density of the probability of an object failure occurring, determined for the considered moment of time.
The average value of the failure rate parameter is denoted as W Wed and is determined by formula (1.6):

1.7 Example of calculation of reliability indicators

Initial data.
During the run from 0 to 600 thousand km, in the locomotive depot, information was collected on TED failures. At the same time, the number of serviceable TEDs at the beginning of the operation period was N0 = 180 pcs. The total number of failed TEDs for the analyzed period was ∑r(600000) = 60. Take the run interval equal to 100 thousand km. At the same time, the number of failed TEDs for each section was: 2, 12, 16, 10, 14, 6.

Required.
It is necessary to calculate the reliability indicators and build their dependencies of change over time.

First you need to fill in the table of initial data as shown in Table. 1.1.

Table 1.1.

Initial data for calculation
, thousand km 0 - 100 100 - 200 200 - 300 300 - 400 400 - 500 500 - 600
2 12 16 10 14 6
2 14 30 40 54 60

Initially, using equation (1.1), we determine for each section of the run the value of the probability of failure-free operation. So, for the section from 0 to 100 and from 100 to 200 thousand km. mileage, the probability of failure-free operation will be:

Let's calculate the failure rate according to equation (1.3).

Then the failure rate in the section 0-100 thousand km. will be equal to:

Similarly, we determine the value of the failure rate for the interval of 100-200 thousand km.

Using equations (1.5 and 1.6), we determine the average time to failure and the average value of the failure rate parameter.

We systematize the results of the calculation and present them in the form of a table (Table 1.2.).

Table 1.2.

The results of the calculation of reliability indicators
, thousand km 0 - 100 100 - 200 200 - 300 300 - 400 400 - 500 500 - 600
2 12 16 10 14 6
2 14 30 40 54 60
P(l) 0,989 0,922 0,833 0,778 0,7 0,667
Q(l) 0,011 0,078 0,167 0,222 0,3 0,333
10 -7 , 1/km 1,111 6,667 8,889 5,556 7,778 3,333
10 -7 , 1/km 1,117 6,977 10,127 6,897 10,526 4,878

Let us present the nature of the change in the probability of failure-free operation of the TED depending on the run (Fig. 1.5.). It should be noted that the first point on the graph, i.e. with a run equal to 0, the value of the probability of no-failure operation will take the maximum value - 1.

Rice. 1.5. Graph of the change in the probability of failure-free operation depending on the operating time

Let us present the nature of the change in the probability of TEM failure depending on the run (Fig. 1.6.). It should be noted that the first point on the graph, i.e. with a run equal to 0, the value of the probability of failure will take the minimum value - 0.

Rice. 1.6. Graph of the change in the probability of failure depending on the operating time

We present the nature of the change in the frequency of failures of TED depending on the run (Fig. 1.7.).

Rice. 1.7. Graph of change in the frequency of failures depending on the operating time

On fig. 1.8. the dependence of the change in the intensity of failures on the operating time is presented.

Rice. 1.8. Graph of change in the failure rate depending on the operating time

2.1 Exponential law of distribution of random variables

The exponential law quite accurately describes the reliability of nodes in case of sudden failures of a random nature. Attempts to apply it to other types and cases of failures, especially gradual failures caused by wear and changes in the physicochemical properties of elements, have shown its insufficient acceptability.

Initial data.
As a result of testing ten high-pressure fuel pumps, their operating times to failure were obtained: 400, 440, 500, 600, 670, 700, 800, 1200, 1600, 1800 hours. Assuming that the operating time to failure of fuel pumps obeys an exponential distribution law.

Required.
Estimate the magnitude of the failure rate, as well as calculate the probability of failure-free operation for the first 500 hours and the probability of failure in the time interval between 800 and 900 hours of diesel operation.

First, let's determine the value of the average time of fuel pumps to failure according to the equation:

Then we calculate the value of the failure rate:

The value of the probability of no-failure operation of fuel pumps with an operating time of 500 hours will be:

The probability of failure between 800 and 900 hours of pump operation will be:

2.2 Weibull-Gnedenko distribution law

The Weibull-Gnedenko distribution law has become widespread and is used in relation to systems consisting of rows of elements connected in series from the point of view of ensuring the reliability of the system. For example, systems servicing a diesel generator set: lubrication, cooling, fuel supply, air supply, etc.

Initial data.
The idle time of diesel locomotives in unscheduled repairs due to the fault of auxiliary equipment obeys the Weibull-Gnedenko distribution law with parameters b=2 and a=46.

Required.
It is necessary to determine the probability of diesel locomotives exiting unscheduled repairs after 24 hours of downtime and the downtime during which the performance will be restored with a probability of 0.95.

Let's find the probability of restoring the locomotive's performance after it has been idle in the depot for a day according to the equation:

To determine the recovery time of the locomotive with a given value of confidence probability, we also use the expression:

2.3 Rayleigh's distribution law

The Rayleigh distribution law is mainly used to analyze the operation of elements that have a pronounced effect of aging (electrical equipment elements, various kinds of seals, washers, gaskets made of rubber or synthetic materials).

Initial data.
It is known that the operating time of contactors to failure in terms of coil insulation aging parameters can be described by the Rayleigh distribution function with the parameter S = 260 thousand km.

Required.
For an operating time of 120 thousand km. it is necessary to determine the probability of failure-free operation, the failure rate and the average time to the first failure of the electromagnetic contactor coil.

3.1 Basic connection of elements

A system consisting of several independent elements connected functionally in such a way that the failure of any of them causes a failure of the system is displayed by the calculated structural diagram of the failure-free operation with series-connected events of the failure-free operation of the elements.

Initial data.
The non-redundant system consists of 5 elements. Their failure rates are respectively 0.00007; 0.00005; 0.00004; 0.00006; 0.00004 h-1

Required.
It is necessary to determine the reliability indicators of the system: failure rate, mean time to failure, probability of failure-free operation, failure rate. Obtain reliability indicators P(l) and a(l) in the range from 0 to 1000 hours with a step of 100 hours.

We calculate the failure rate and mean time to failure using the following equations:

The values ​​of the probability of failure-free operation and the frequency of failures will be obtained using the equations reduced to the form:

Calculation results P(l) And a(l) in the interval from 0 to 1000 hours of operation we will present in the form of a table. 3.1.

Table 3.1.

The results of calculating the probability of failure-free operation and the frequency of failures of the system in the time interval from 0 to 1000 hours.
l, hour P(l) a(l), hour -1
0 1 0,00026
100 0,974355 0,000253
200 0,949329 0,000247
300 0,924964 0,00024
400 0,901225 0,000234
500 0,878095 0,000228
600 0,855559 0,000222
700 0,833601 0,000217
800 0,812207 0,000211
900 0,791362 0,000206
1000 0,771052 0,0002

Graphic illustration P(l) And a(l) in the section to the average time to failure is shown in Fig. 3.1, 3.2.

Rice. 3.1. Probability of failure-free operation of the system.

Rice. 3.2. System failure rate.

3.2 Redundant connection of elements

Initial data.
On fig. 3.3 and 3.4 show two block diagrams for connecting elements: general (Fig. 3.3) and element-by-element redundancy (Fig. 3.4). The probabilities of failure-free operation of the elements are respectively equal to P1(l) = P ’1(l) = 0.95; P2(l) = P'2(l) = 0.9; P3(l) = P '3(l) = 0.85.

Rice. 3.3. Diagram of a system with general redundancy.

Rice. 3.4. Scheme of a system with element-by-element redundancy.

The probability of failure-free operation of a block of three elements without redundancy is calculated by the expression:

The probability of failure-free operation of the same system with total redundancy (Fig. 3.3) will be:

The probabilities of failure-free operation of each of the three blocks with element-by-element redundancy (Fig. 3.4) will be equal to:

The probability of failure-free operation of the system with element-by-element redundancy will be:

Thus, element-by-element redundancy gives a more significant increase in reliability (the probability of failure-free operation increased from 0.925 to 0.965, i.e. by 4%).

Initial data.
On fig. 3.5 shows a system with a combined connection of elements. In this case, the probabilities of failure-free operation of the elements have the following values: P1=0.8; P2=0.9; P3=0.95; P4=0.97.

Required.
It is necessary to determine the reliability of the system. It is also necessary to determine the reliability of the same system, provided that there are no redundant elements.

Fig.3.5. Scheme of the system with the combined functioning of elements.

For the calculation in the original system, it is necessary to select the main blocks. There are three of them in the presented system (Fig. 3.6). Next, we calculate the reliability of each unit separately, and then find the reliability of the entire system.

Rice. 3.6. Blocked scheme.

The reliability of the system without redundancy will be:

Thus, a non-redundant system is 28% less reliable than a redundant system.

Reliability and survivability of onboard computing systems (BCVS).

Reliability is the property of products to perform the required functions, while maintaining their performance within specified limits for the required period of time.

Vitality - the ability of a computer system to perform its basic functions, despite the damage received and failed hardware elements.

More stringent requirements are imposed on the reliability and survivability of BUVM and BTsVS than on the reliability and survivability of universal and personal computers. If the onboard computer fails, the system's performance is disrupted, and the assigned tasks are not performed, which can lead to irreparable consequences, including human casualties.

Re-solving the problem after restoring the onboard computer and onboard computer is often impossible. So, for example, in the event of a failure in the operation of the anti-aircraft missile system, the defended object will be destroyed. And, if you restore the system in a short time, then the destruction cannot be returned in the same way as the lost lives. A failure in avionics can lead to an aircraft crash or spontaneous missile launch. In this case, restoring the operation of the BCVS will also not allow correcting the consequences of the error.

Ensuring the high reliability and survivability of the BTsVS is complicated by the operating conditions of the equipment on board with large fluctuations in temperature, humidity, mechanical loads and high dustiness. The same limitation is imposed on the dimensions and weight of the equipment. This mainly applies to aviation, but it is also of great importance for the BCVS in other areas.

Thus, the problem of reliability and survivability of onboard computers and onboard computers has a number of features due to the peculiarity of the onboard computer structure and the nature of the functions they perform.

The task of providing high reliability and survivability in a complex system can be very costly, complex and time-consuming, although difficulties in production and problems arising during operation due to the need to ensure and maintain the required level of reliability can cause even greater difficulties. .

For example, if the reliability of a missile system is reduced by 10%, to ensure the same degree of target destruction, an increase of at least 10% in the actual number of combat missiles will be required. These missiles require additional launch pads, test facilities, launch equipment, maintenance personnel and support equipment, which is costly and time consuming.

The more complex the structure of a computer system, the more difficult it is to ensure reliability and survivability. It should be noted that most of the failures that have occurred in launches of guided missiles and artificial satellites in the United States were not caused by the malfunction of any exotic device, the design of which accelerated the progress of the state of the art. On the contrary, many failures were caused by the failure of functional and structural elements of a previously approved design. Sometimes the elements were manufactured incorrectly, and in other cases there were errors in the work of programmers or maintenance personnel. There is no such trifle that would be too insignificant not to be a possible reason for refusal. High potential and practically achievable reliability is largely the result of deep and close attention to detail.

The problem of increasing reliability and fault tolerance is inherent not only to BCVS, but also to commercial equipment. For example, in a Google cluster, on average, 1 computer fails per day (that is, crashes occur on about 3% of computers per year). Of course, due to the redundancy of data and code, these failures are invisible to users, but for a programmer they are a big problem.

The case when a computer system or part of it is out of order, and further work is impossible without repair, is called a failure.

The theory of reliability distinguishes 3 characteristic signs of failures, which can be inherent in equipment and manifest themselves without any influence from people.

1. Running-in failures. These failures occur during the early period of operation and in most cases are caused by a lack of production technology and defects in the manufacture of computer system elements. These failures can be eliminated by the process of rejection, running in and technological testing of the finished product.

2. Defective or gradual failures. These are failures that occur due to the wear of individual parameters or parts of the equipment. They are characterized by a gradual change in the parameters of the product or elements. At the beginning, these failures may appear as temporary failures. However, as wear increases, temporary failures turn into serious hardware failures. These failures are a sign of aging BCVS. They can be partially eliminated with proper operation, good prevention and timely replacement of worn-out equipment elements.

3. Sudden or catastrophic failures. These failures cannot be eliminated either by debugging the equipment, or by proper maintenance, or preventive maintenance. Sudden failures occur randomly, no one can predict them, however, they obey certain laws of probability. So the frequency of sudden failures over a sufficiently long period of time becomes approximately constant. This happens in any device. An example of random failures is open or shorted circuits. Such a failure usually leads to the fact that the output is permanently either 0 or 1. If random failures occur, it is necessary to replace the elements in which they occurred. To do this, the computer system must be maintainable and allow you to quickly carry out preventive work in the field.

In a separate group, intermittent failures or failures can be distinguished. A failure is a short-term disruption of the normal operation of an on-board computer, in which one or more of its elements, when performing one or more related operations, gives a random result. After a failure, the computing system can function normally for a long time.

Failures can be caused by electromagnetic pickups, mechanical impacts, etc. Often failures do not lead to failure of the complex, but only change the course of software operation due to incorrect execution of one or more commands, which can lead to catastrophic consequences. The difference between failures and failures is that when the consequences of a failure are detected, it is necessary to restore not the equipment, but the information distorted by the failure.

Talking about failures, it is necessary to mention the so-called Schrödinbugs. A schrodinbug is an error in which a computer system functions normally for a long time, however, under certain conditions, for example, setting non-standard operation parameters, a failure occurs. When analyzing this failure, it turns out that the computer system software has a fundamental error, due to which it should not have functioned in principle.

A Schrödinbug can be formed by a complex combination of pair errors (when an error in one place is compensated by an error of the opposite action in another place). Under certain circumstances, the balance of errors is destroyed, which leads to the paralysis of work.

Thus, BCVS is characterized by another property that determines its reliability - the error-free or reliable operation. Consequently, the reliability of the BTsVS is a combination of reliability, reliability of operation, survivability and maintainability.

The reliability parameters are:

1. Failure rate -

2. MTBF -

3. Probability of failure-free operation within a given time - P

4. Probability of failure - Q

Failure rate

Failure rate is the frequency with which failures occur. If the equipment consists of several elements, then its failure rate is equal to the sum of the failure rates of all elements whose failures lead to equipment failure.

The failure rate curve, depending on the operating time, is shown in the figure below.

At the beginning of operation (at time t = 0), a large number of elements are put into operation. This set of elements at the beginning may have a high failure rate, due to defective samples. Since the defective elements fail one by one, the failure rate decreases relatively quickly during the run-in period and becomes approximately constant by the time of normal operation (T norms), when the defective elements have already failed and have been replaced with serviceable ones.

The set of elements that have passed the run-in period has the lowest failure rate, which remains approximately constant until the elements begin to fail due to wear (T wear). From this point on, the failure rate starts to increase.

MTBF

MTBF is the ratio of total hours worked to total number of failures. During the period of normal operation, when the failure rate is approximately constant, the mean time between failures is the reciprocal of the failure rate:

Probability of failure-free operation.

The probability of uptime is the probable or expected number of devices that will function without failure during a given period of time:

This formula is valid for all devices that have been run in but are not affected by wear. Therefore, the time t cannot exceed the period of normal operation of the devices.

A graph showing the probability of failure-free operation as a function of normal operation time is shown below:

Failure probability.

The probability of failure is the reciprocal of the probability of failure-free operation.

Rated failure rate.

Equipment elements are designed so that they can withstand certain nominal: voltage, current, temperature, vibration, humidity, and so on. When the equipment is exposed to such influences during operation, a certain specific failure rate is observed. It is called the nominal failure rate.

When the overall working load or some particular loads or environmental hazards increase beyond the nominal levels, the failure rate rises quite sharply compared to its nominal value. Conversely, the failure rate decreases when the load drops below the nominal level.

For example, if a cell must operate at a nominal temperature of 60 degrees, lowering the temperature as a result of a forced cooling system can reduce the failure rate. However, if a decrease in temperature entails too much increase in the number of elements and weight of the apparatus, then it may be more advantageous to select elements with an increased operating temperature rating and use them at a temperature below the nominal temperature. In this case, the equipment can become cheaper, and the mass less (which is important when working in an aircraft) than when using a forced cooling system.

Methods for determining the reliability of BTsVS.

When new products are designed and built by mechanical, electrical, chemical, or other measurements, the value of the failure rate cannot be determined. Failure rates can be determined by collecting statistical data obtained from reliability testing of this or similar products.

The probability of failure-free operation during any test time is expressed by the formula:

The failure rate is determined by the formula:

When measuring the failure rate, it is necessary to maintain a constant number of elements participating in the test by replacing the failed elements with new ones.

Thus, in order to obtain data on the quantitative characteristics of the reliability of the equipment, it is necessary to make a special sample of the equipment for reliability testing. Reliability tests should be carried out under conditions corresponding to the actual operating conditions of the equipment in terms of external influences, the frequency of switching on and changes in power parameters.


Internet