Why UQ is precious and DS is practical for bungee jumping
The usefulness of uncertainty quantification (UQ) with deterministic sampling (DS) may be illustrated with a bungee jumping enterprise. One critical aspect for a successful business and the well-being of customers is to find a high quality rope. Assume two different companies offer ropes with the following tensile strengths (in Newton),
A: TS=5000 N,
B: TS=5300 N.
Given this information, most people would without hesitation claim rope B to be strongest. As will be explained though, Rope A might excel rope B. Quality is here implicitly assumed to be communicated through the perceived number of relevant digits of the TS. That is a habit rather than a reliable rule. Stating that rope A is durable up to 5000 N may indicate failure in the range 4950 N to 5050 N, but it could equally well imply a range 4500 N to 5500 N, or 4995 N to 5005 N. It all depends on how we interpret the relevance of the numbers ‘0’. The implicitly communicated information is clearly ambiguous. That means our uncertainty must be explicitly given as a calculated number, obtained by statistical analysis and/or a credible UQ method. Otherwise we might claim it is possible to find,
A: TS=5500 N,
B: TS=5250 N,
as the seller of A might do to elevate his product. Such con artistry is made possible by leaving UQ aside.
Distinguishable or indistinguishable strength?
Ropes A and B may only be claimed different with a reliable number of uncertainty. If
A: TS=5000 +/-100 N,
B: TS=5300 +/-100 N,
the intersection of the uncertainty intervals is empty so we are confident B is better than A. However, if
A: TS=5000 +/-400 N,
B: TS=5300 +/-200 N,
A may be stronger than B, or vice versa. If a particular rope breaks at 5200 we cannot tell if it is of type A or B — as they have indistinguishable TS.
Buying ropes — statistical sampling
For the bungee jump company, an obvious method to gain some certainty about the product spread is to test samples of ropes. Ignorant physical sampling of this kind underpins mathematical statistics and is mainly motivated by lack of any other type of causal information. It is a typical perspective in situations where little or no insight can be gained about the physical mechanism how ropes respond to stress.
Ronald Fisher studied how to grow crops around 100 years ago and is one of the giants in modern statistics. He distinguished any sample from the population it had been drawn from. The expected error and variation of estimation was quantified in bias and sampling variance, respectively. This view lead him to propose the method of hypothesis testing but also state that statistics is about plausibility and never certainty of any kind. To evaluate the tensile strengths of rope A and B, Ronald would probably suggest two approaches, depending on interest:
1. What is the strength of each rope? How accurate numbers of mean and spread are needed?
2. Which rope is strongest? How certain do you need to be?
In case 1, the expected mean and spread around the mean will be estimated from sample means. The requested accuracy will set the minimal sample size, as that determines the expected (no guarantee!) error of analysis. In case 2, a hypothesis A:TS>B:TS is formulated, with the hope it will be rejected. The requested certainty of correct rejection will determines the sample size — for an exceedingly small sample the risk of incorrect rejection is always large. It is not allowed to successfully increase the sample size and repeat the evaluation until we reject the hypothesis (nevertheless sometimes practiced). That will not correspond to the actual original design of Fishers’ experiment — the repetition must be explicitly accounted for, otherwise we will with almost complete certainty falsify the hypothesis even if it is true!
Abraham Wilks was a contemporary statistician of R. Fisher and proceeded his reasoning into a more industrial context by addressing manufacturing quality. At this time the understanding of what made products good or bad probably was probably not much better known to Wilks, than what made crops grow to Fisher. Their mutual lack of knowledge paved the path for them to apply similar statistical approaches based on willful ignorance of non-statistical information. Wilks’ method is currently practiced in nuclear power calculations and is perhaps the simplest conservative estimate of uncertainty bounds one can ever imagine, but it only makes sense in the absence of causal knowledge. In its simplest form, a one-sided bound of the 95% confidence limit is with 95% probability given by the most extreme value of a 59 random outcomes. Abraham would here suggest that the weakest rope of 59 sampled ropes provides a lower bound of the tensile strength of all(!) possible ropes supplied by each manufacturer. Inference of all possible outcomes, or the population is here made from small finite random sample, similar to Fishers approach. The penalty of estimation is however large, as the relative uncertainty of Wilks’ estimate may exceed 100% with a bias of the same order enforced by conservatism. This cost of poor estimation must be paid in the absence of causal information and knowledge of just a small sample. Given these constraints, Wilks’ method is nevertheless ingenious in its simplicity and consistency.
Selling ropes — optimal design
Now change focus from the buyer to the seller and specifically, its research and development department. Their typical setup is to propose a fairly large number of rope variations, like versions A1,A2,…,A10 based on how the production might be varied. A common procedure of development is to predict the rope performance of each version by formulating and running a corresponding mathematical model. The best version may then be chosen based on the model results. Relying upon precise numbers, almost certainly only one version will be selected. If the modeling error happens to exceed the difference to the second best version, there is a risk of not choosing the best design. Awareness of the problem of imperfect modeling, suggest that several versions may qualify for final experimental testing. Clearly, a relevant quantified uncertainty is required to avoid incorrect rejection. Thus,
If we believe in modeling as a general method for selection in product development, we are obliged to evaluate its uncertainty properly, or at least conservatively.
The modeling uncertainty is in fact often underestimated due to the prevailing ignorance of parameter dependencies, likely inherited from statistics where, on the contrary, it is often an appropriate assumption. The principal objection here is that model results per definition are always correlated, which implies that uncertain parameters of optimal models must be correlated!
Quantifying modeling uncertainty
Determining the rope tensile strength variations from a common manufacturing perspective of today is fundamentally different to Wilks’ setup. The prior knowledge of any industrial process normally relate to model equations and all possible outcomes, i.e. the population of ropes. The task is to propagate population statistics, either from calibration data through model equations to prediction of tensile strength, or in reverse to improve predictability by finding the most appropriate model equations (identification). That is very different from the inference of population statistics from random samples addressed by Fisher, Wilks and their successors in statistics, a fact which currently is poorly understood and appreciated. By not utilizing claimed knowledge of the population and related causal information of the model equations, UQ sampling methods are inefficient and unreliable. In addition, our credibility requires a critical judgment of our prior knowledge and humbly accepting that it is incomplete. That cause our prediction of rope performance to be ambiguous, as we do not really know all that is needed exactly. Instead of bias and sampling variance, error and ambiguity results from formulating UQ in population statistics. The difference between sampling variance and ambiguity is paramount, since the latter refers to smallness of samples while the latter relates to incomplete information. This discussion clarifies but not solves the UQ problem for the manufacturer predicting the bungy jump rope tensile strength. So what methods are available?
Random sampling may be a common approach to propagate the uncertainty of the governing model equations of the rope to predict its tensile strength, like TS=5000 +/-200, but is violates virtually the whole context outline above. Generating a finite sample from population statistics rather than the reverse, we are running our statistical inference ‘backwards’, compared to Fisher and Wilks. The intermediate random sample will invite sampling variability, exactly the source of errors which were Fishers and Wilks main concern! Random sampling will add errors beyond our control. Even more troublesome is that we usually only run one sample and have little knowledge on how large these errors just happened to be. Another aspect is that a random generator requires complete statistical knowledge, which is extremely rare. Filling the gaps by e.g. assuming independence of various sources of uncertainty (very common) is not at all plausible since hitting a value zero has zero probability. Guesses are then propagated to predictions and it is difficult to understand why that is better than guessing the result directly. The most frequently debated problem of random sampling is its low computational efficiency — over-sized samples are required to suppress sampling variability. For a complex problem like weather forecasting there seems to be virtually no limit to how many computers needed, in order to reduce an error we introduced ourselves…
Deterministic sampling is developed from an entirely different and novel perspective. It fully respects the difference in the contexts of scientific modeling and statistics, as well as between samples and population, maximizes the utilization of prior information and accepts ambiguity. Samples are used to represent, i.e. describe the claimed population statistics, similar to how Fourier series encode and describe images and signals in hierarchies of information. Population statistics may conveniently be organized in statistical moments, how they are evaluated are well known. How they are estimated from finite samples is however of little interest, as all calculations refer to populations of outcomes. As finite random samples are entirely avoided throughout, there is no sampling variability and all results are completely reproducible. That does not imply there is no error, as approximations might still be needed. The ambiguity due to lack of information is probed by utilizing several ensembles representing the same known information, but different alternatives of the unknown. This approach is different but resembles how Wilks and Fisher would illustrate sampling variance.
The uncertainty of modeled ropes
As a trivial illustration of the difference between random and deterministic sampling for quantifying the uncertainty of modeled ropes A, assume their tensile strength is described by the model R=5000(1+X), where X is a known uncertain component with standard deviation of 0.1. To sample X randomly (rnd), the full distribution is needed so we guess that it is normal distributed. Assume typical affordable sample sizes are 10 (‘rnd10’) and 100 (‘rnd100’). To sample X deterministically (det), it is possible to represent the standard deviation with just two(!) sample values, -0.1 and +0.1. Repeated evaluation of R for the sample followed by estimation/evaluation of the mean and standard deviation (std) may (A) / will (det) result in:
A-rnd10: mean TS = (4869, 5357), std TS = (406, 656)
A-rnd100: mean TS = (4996, 5071), std TS = (521, 453)
A-det: TS =5000 +/-500 (always)
For this model, the A-det result is absolutely correct, while A-rnd has a finite error. Without repetition it is generally impossible to understand how wrong A-rnd might be!
Further, let’s assume rope model B is slightly more complex, R=5000(1+X)^2. Similarly to A,
B-rnd10: mean TS = (4771, 5817), std TS = (765, 1407)
B-rnd100: mean TS = (5046, 5184), std TS = (1047, 918)
The true result is here ambiguous since knowing the std of X is not sufficient for determining the std of R. To exemplify this ambiguity explicitly, the deterministic samples sqrt(3/2)[-0.1,0,0.1] and sqrt(2/5)[-0.2 -0.1 0.1 0.2] might also be utilized to obtain three different equally valid results,
B-det1: mean TS =5050.0, std TS = 1000.0
B-det2: mean TS =5050.0, std TS = 1000.6
B-det3: mean TS =5050.0, std TS = 1000.4
It is worth emphasizing that the only way to eliminate ambiguity is to provide complete statistical information. Clearly, what that means depends on the model structure — the same information was complete for model A but not model B. This discussion illustrates the typical reasoning of deterministic sampling. Only the imagination limits its utilization, which provides a challenge for us to express how a deterministic sampling approach might be beneficial for your specific application.