Discrete Event Simulation Tool, Example 1, Single Unit Failure/Repair

Discrete event simulation is a powerful technique that can be used to to solve more complex system reliability modeling problems. This article introduces the some of the capabilities of the Discrete Event Simulation tool in the Reliability Analytics Toolkit.

The Discrete Event Simulation tool can be used for:
1. Estimating system mean time between critical failure (MTBCF) for a system consisting of units with different failure and repair scenarios.
2. Estimating system operational availability (Ao).
3. Providing graphical visualizations of the overall failure and repair process for individual units, as well as a system of units operating together.
4. Estimating spare part requirements and the impact of different policies, such as local versus remote spare parts, on Ao and MTBCF.
5. Other custom user studies (by exporting the simulation results to Excel).

Example 1 is purposefully very simple in order to allow the reader to easily follow the steps and confirm the output solution to the maximum extent possible.

Example 1: Simulate one-year usage for a single unit that fails every 1,000 hours and is restored after exactly 100 hours of downtime.

Solution 1:

Inputs: Edit the input in box #1 (note, use one of the existing example lines shown when the web page first loads) so only a single line, as highlighted in yellow below, remains. The interpretation of the line highlighted below is that “unit1” (a unique name in field1) fails exactly every 1,000 hours (field 2) and is restored exactly 100 hours after failure (field 3). An optional description is entered, “Power Supply” (field 4) and optional part number, “123” (field 5).  Input is expected as a copy/paste from Microsoft Excel, which produces “hidden tabs” between input fields.  If Excel is not used, then copy and paste example default lines present when the page loads.

Since the problem calls for simulating for one year, we leave the default value of 8,760 hours (one year) for input #4. Input #2 can be used to define a “watch list” to monitor for various combinations of failure events. For this example, we define the failure of unit1 as a critical failure.  Input #3 is optional and can be used to periodically restore items to “as good as new.” Only items previously defined in box #1 above are meaningful input. For input #2, after each simulated failure event, each user defined set of critical items is compared to the current list of items in a failed state. If the set is a subset of the current set of failed items, then a “critical failure” is tallied. If the number of units or items being considered is large, the process of defining combinations of units that result in a critical failure can be tedious; however, first using this tool (option 1d) to enumerate all possible states for a given list of units and then selecting the critical combinations may be helpful.

Outputs: The results of the simulation are first presented as three types of graphs, as shown below.  The first, an “Up/Down” graph shows when unit1 is “up” and when it is “down” for maintenance.  For this example, it is up for exactly 1,000 hours, then fails, and is then restored after exactly 100 hours. There will be one up/down graph for  for each unique item name entered into box #1 on the input page.  These names are shown in the lower left corner of each up/down graph.

The second graph shows the cumulative number of failures over the entire simulation time for a “system.”  In this case the “system” is only unit1; however,  in general, a “system” is considered to be whatever number of unique item names that are entered into box #1 on the input screen, which only unit1 in this case.

The third graph shows the quantity of units that are either up or down for the system.   Since the system consists of only a single unit, this graph just cycles between one and zero every 1,000 hours. The forth graph shows the cumulative critical failures, which is identical to the second graph because there is no redundancy/fault tolerance. Rhe final graph is the system mission up/down graph.  This is identical to the first graph because there is only a single unit with no redundancy – all failures are critical.

The next section of the results page shows the simulation results in the form of a  summary table, as shown below.  The first part of the table shows the time that each event occurs.  For this example, the first failure occurs at 1,000 hours and repair is complete 100 hours later at 1,100 hours. The second column shows the event description, which is either a failure or restoration of unit1.  The third column lists the part number (PN), which is an optional field that is primarily used for other custom studies when exporting results to Excel, for example, to roll up failures and repairs by part number and estimate associated depot support and maintenance costs.  The forth column is a failure count, which in this case totals to eight failures occurring over 8,760 hours. The fifth column is a count of repair actions, which totals to seven repairs over 8,760 hours.  The repair of the final failure was not completed because it occurred at time 8,700 hours and was scheduled to be repaired by time 8,800 hours; however, the simulation ended at 8,760 hours. The sixth column keeps track of restoration time, which totals 700 hours for this example.  The seventh column shows the quantity of units in an up state after each simulation event.  Since only one unit is included in the simulation, the quantity of units up just cycles between one and zero.  The eighth column shows the time spent in each state; which for this simple example is exactly 1,000 hours up followed by 100 hours down.  The ninth column shows the units that are in a failed state after each simulated event (note, SN is serial number, which is just a unique identifier for each item entered in input #1).  In this case, we only have a single unit, so this column just shows unit1 going into and out of a failed state. Note that unit1 spends the final 60 hours in a failed state, and is shown in a failed state at the simulation end time of 8,760 hours. The tool allows the user to define combinations of events that will result in a “critical” failure. Since we defined unit1 as critical in input#2, columns 10 and 11 keep track of critical failures (CF) and associated downtime. All data can be exported to  Excel for further offline analysis by selecting Excel output for item #7 on the input page.

Just below the list of events (shown above) are simulation summary calculations, as shown below. The total restoration time is the sum of the times shown in column six above, 700 hours.  Eight failures occurred but only seven repair actions were completed because the simulation ended before the last repair was complete. The simulated system MTBCF is computed as 1,143 operating hours, which is approximately what is expected based on the input. The simulated MTBCF is not exactly 1,000 hours because of the point where the simulation was stopped.  

If the simulation is rerun for 8,800 hours instead of 8,760 hours, then the simulated MTBCF is exactly 1,000 operating hours, as shown below. Additional metrics, such as availability and MTBCF at a user selected confidence level (chi-square single sided lower limit) are also provided.

The next part of the simulation table summarizes the state of the system over the course of the simulation.  For this simple example, the system, or unit1, can either be up or down, so the table indicates that “1 units up” for 8,000 hours, or 91.32 percent of the simulated time. Note that the most basic definition of system availability is up time divided by total time, which is 8,000 hours/8,760 hours = 0.9132, or 91.32%.  On eight occasions the system transitioned from having one unit up to having zero units up, as shown below in the “count of failure transitions” column. The top half of the picture below shows cumulative statistics while the bottom half shows times associated with exactly some quantity of units operating.

The final portion of the output table summarizes the user entered inputs. In this case, it shows just unit1, with a 1,000 hour MTBF and a 100 hour MTTR. In the right hand columns it also shows the number of simulated failures, downtime, uptime, Ao, failure rate and MTBF associated with each individual unit.  The MTBF is the lower limit one-sided confidence limit MTBF.  The calculation is equivalent to entering the uptime and number of failures into this tool and calculating the one-sided lower confidence bound with equation 2.  The seed is used by the random number generator to select different sequences of random numbers, which are used when probability distributions are defined in input box #1. For this simple example, unit1 fails exactly at 1,000 hour intervals, so there is no failure distribution and we are not generating any random numbers to use in a probability distribution for generating simulated failure times. Therefore, the seed has no effect in this example.  However if we changed the input to specify that unit1 failed in accordance with the exponential failure distribution, with a mean time between failure (MTBF) of 1,000 hours, then the seed would be used to select a set of random numbers to be used for generating simulated failure times.  If the same seed is used, then the same results will be obtained for a given set of inputs (input box #1).  If the seed is left blank, then the Google App Engine server generates the seed and simulation results will be somewhat different for each subsequent simulation trial conducted, although the conclusions drawn from different simulations using different seeds should be similar. Therefore, if there is a desire to exactly duplicate the simulation results at a later date, a user defined seed should be entered.

Another available option is to select the “Export to Excel” option (input #7), which results in the output shown below. 

The underlying discrete event simulation engine is SimPy (Simulation in Python), which runs on the Google App Engine. See the references listed below for additional details on SimPy.

The above was a very simple example for the purpose of allowing the reader to easily follow along.  However, the power of the discrete event simulation technique is to allow for estimates of system MTBF, MTBCF and Ao for far more complex systems that may include fault tolerance and individual units that are expected to in fail in accordance with different failure distributions, such as the exponential distribution for electronics and a Weibull failure distribution with an increasing  hazard rate (failure rate) for mechanical items which are subject to wearout over time. An example of defining a more complex input scenario for input box #1 is shown below.

The picture below shows an example of defining a critical failure “watch list” based on the example reliability block diagram model showing the quantity of units that must be operational for mission success.  For example, only one of the two flow regulators are required, sn1 or sn2, therefore the first line of the input creates the watch list set “sn1 and sn2”. If these two items fail at the same time during the simulation, a critical failure will be tallied.

The State enumeration and reliability tool may be helpful in defining the above watch list for more complex redundant configurations. For example, for the redundant power supplies (sn5, sn6, sn7), where at least 2 of 3 are required to operate for success, entering the inputs shown below in the first picture results in the output shown in the second picture, easily identifying the three sets needed for the watch list.

.

 

References:

  1. SimPy Home Page
  2. Matloff, Norm, University of California at Davis, Dept. of Computer Science, Introduction to Discrete-Event Simulation and the SimPy Language
  3. Matloff, Norm, University of California at Davis, Dept. of Computer Science, A Discrete-Event Simulation Course Based on the SimPy Language
  4. Python pseudo-random number generator