Issue 
EPJ Nuclear Sci. Technol.
Volume 2, 2016



Article Number  36  
Number of page(s)  10  
DOI  https://doi.org/10.1051/epjn/2016026  
Published online  16 September 2016 
https://doi.org/10.1051/epjn/2016026
Regular Article
The impact of metrology study sample size on uncertainty in IAEA safeguards calculations
SGIM/Nuclear Fuel Cycle Information Analysis, International Atomic Energy Agency, Vienna International Centre,
PO Box 100,
1400
Vienna, Austria
^{⁎} email: t.burr@iaea.org
Received:
4
January
2016
Accepted:
23
June
2016
Published online: 16 September 2016
Quantitative conclusions by the International Atomic Energy Agency (IAEA) regarding States' nuclear material inventories and flows are provided in the form of material balance evaluations (MBEs). MBEs use facility estimates of the material unaccounted for together with verification data to monitor for possible nuclear material diversion. Verification data consist of paired measurements (usually operators' declarations and inspectors' verification results) that are analysed oneitematatime to detect significant differences. Also, to check for patterns, an overall difference of the operatorinspector values using a “D (difference) statistic” is used. The estimated DP and false alarm probability (FAP) depend on the assumed measurement error model and its random and systematic error variances, which are estimated using data from previous inspections (which are used for metrology studies to characterize measurement error variance components). Therefore, the sample sizes in both the previous and current inspections will impact the estimated DP and FAP, as is illustrated by simulated numerical examples. The examples include application of a new expression for the variance of the D statistic assuming the measurement error model is multiplicative and new application of both random and systematic error variances in oneitematatime testing.
© T. Burr et al., published by EDP Sciences, 2016
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction, background, and implications
Nuclear material accounting (NMA) is a component of nuclear safeguards, which are designed to deter and detect illicit diversion of nuclear material (NM) from the peaceful fuel cycle for weapons purposes. NMA consists of periodically comparing measured NM inputs to measured NM outputs, and adjusting for measured changes in inventory. Avenhaus and Canty [1] describe quantitative diversion detection options for NMA data, which can be regarded as time series of residuals. For example, NMA at large throughput facilities closes the material balance (MB) approximately every 10 to 30 days around an entire material balance area, which typically consists of multiple process stages [2,3].
The MB is defined as MB = I_{begin} + T_{in} − T_{out} − I_{end}, where T_{in} is transfers in, T_{out} is transfers out, I_{begin} is beginning inventory, and I_{end} is ending inventory. The measurement error standard deviation of the MB is denoted σ_{MB}. Because many measurements enter the MB calculation, the central limit theorem, and facility experience imply that MB sequences should be approximately Gaussian.
To monitor for possible data falsification by the operator that could mask diversion, paired (operator, inspector) verification measurements are assessed by using oneitematatime testing to detect significant differences, and also by using an overall difference of the operatorinspector values (the “D (difference) statistic”) to detect overall trends. These paired data are declarations usually based on measurements by the operator, often using DA, and measurements by the inspector, often using NDA. The D statistic is commonly defined as , applied to paired (O_{j},I_{j}) where j indexes the sample items, O_{j} is the operator declaration, I_{j} is the inspector measurement, n is the verification sample size, and N is the total number of items in the stratum. Both the D statistic and the oneitematatime tests rely on estimates of operator and inspector measurement uncertainties that are based on empirical uncertainty quantification (UQ). The empirical UQ uses paired (O_{j},I_{j}) data from previous inspection periods in metrology studies to characterize measurement error variance components, as we explain below. Our focus is a sensitivity analysis of the impact of the uncertainty in the measurement error variance components (that are estimated using the prior verification (O_{j},I_{j}) data) on sample size calculations in IAEA verifications. Such an assessment depends on the assumed measurement error model and associated uncertainty components, so it is important to perform effective UQ.
This paper is organized as follows. Section 2 describes measurement error models and error variance estimation using Grubbs' estimation [4–6]. Section 3 describes statistical tests based on the D statistic and oneverificationitematatime testing. Section 4 gives simulation results that describe inference quality as a function of two sample sizes. The first sample size n_{1} is the metrology study sample size (from previous inspection periods) used to estimate measurement error variances using Grubbs' (or similar) estimation methods. The second sample size n_{2} is the number of verification items from a population of size N. Section 5 is a discussion, summary, and implications.
2 Measurement error models
The measurement error model must account for variation within and between groups, where a group is, for example, a calibration or inspection period. The measurement error model used for safeguards sets the stage for applying an analysis of variance (ANOVA) with random effects [4,6–9]. If the errors tend to scale with the true value, then a typical model for multiplicative errors is (1) where I_{ij} is the inspector's measured value of item j (from 1 to n) in group i (from 1 to g), μ_{ij} is the true but unknown value of item j from group i, is the “item variance”, defined here as , is a random error of item j from group i, and is a shortterm systematic error in group i. Note that the variance of I_{ij} is given by . The term is the called “product variability” by Grubbs [6]. Neither R_{Iij} nor S_{Ii} are observable from data. However, for various types of observed data, we can estimate the variances and . The same error model is typically also used for the operator, but with and . We use capital letters such as I and O to denote random variables and corresponding lower case letters i and o to denote the corresponding observed values.
Figure 1 plots simulated example verification measurement data. The relative difference d˜ = (o − i)/o is plotted for each of 10 paired (o,i) measurements in each of 5 groups (inspection periods), for a total of 50 relative differences. As shown in Figure 1, typically, the betweengroup variation is noticeable compared to the withingroup variation, although the betweengroup variation is amplified to a quite large value for better illustration in Figure 1; we used δ_{RO} = 0.005, δ_{SO} = 0.001, δ_{RI} = 0.01, δ_{SI} = 0.03, and the value δ_{SI} = 0.03 is quite large. Figure 2a is the same type of plot as Figure 1, but is for real data (four operator and inspector measurements on drums of UO_{2} powder from each of three inspection periods). Figure 2b plots inspector versus operator data for each of the three inspection periods; a linear fit is also plotted.
Fig. 2 Example real verification measurement data. (a) Four paired (O,I) measurements in three inspection periods; (b) inspector vs. operator measurement by group, with linear fits in each group. 
Fig. 1 Example simulated verification measurement data. The relative difference d˜ = (o − i)/o is plotted for each of 10 paired (o,i) measurements in each of 5 groups, for a total of 50 relative differences. The mean relative difference within each group (inspection period) is indicated by a horizontal line through the respective group means of the paired differences. 
2.1 Grubbs' estimator for paired (operator, inspector) data
Grubbs introduced a variance estimator for paired data under the assumption that the measurement error model was additive. We have developed new versions of the Grubbs' estimator to accommodate multiplicative error models and/or prior information regarding the relative sizes of the true variances [4,5]. Grubbs' estimator was developed for the situation in which more than one measurement method is applied to multiple test items, but there is no replication of measurements by any of the methods. This is the typical situation in paired (O,I) data.
Grubbs' estimator for an additive error model can be extended to apply to the multiplicative model equation (1) as follows. First, equation (1) for the inspector data (the operator data is analysed in the same way) implies that the withingroup mean squared error (MSE), , has expectation where is the average value of μ_{ij} (assuming that each group has the same number of paired observations n). Second, the betweengroup MSE, , has expectation Therefore, both and are involved in both the within and betweengroups MSEs, which implies that one must solve a system of two equations and two unknowns to estimate and [4,5]. By contrast, if the error model is additive, only is involved in the withingroup MSE, while both and are involved in the betweengroup MSE. The term in both equations is estimated as in the additive error model, by using the fact that the covariance between operator and inspector measurements equals [4,5]. However, will be estimated with nonnegligible estimation error in many cases. For example, see Figure 2b where the fitted lines in periods 1 and 3 have negative slope, which implies that the estimate of is negative in periods 1 and 3 (but the true value of cannot be negative in this situation). We note that in the limit as approaches zero, the expression for the withingroup MSE reduces to that in the additive model case (and similarly for the betweengroup MSE).
3 Applying uncertainty estimates: the D statistic and oneatatimeverification measurements
This paper considers two possible IAEA verification tests. First, the overall D test for a pattern is based on the average difference, . Second, the oneatatime test compares the operator to the corresponding inspector measurement for each item and a relative difference is computed, defined as d_{j} = (o_{j} − i_{j})/o_{j}. If d_{j} > 3δ, where where and (or some other alarm threshold close to the value of 3 that corresponds to a small false alarm probability), then the jth item selected for verification leads to an alarm. Note that the correct normalization used to define the relative difference is actually d_{j} = (o_{j} − i_{j})/μ_{j}, which has standard deviation exactly δ. But μ_{j} is not known in practice, so a reasonable approximation is to use d_{j} = (o_{j} − i_{j})/o_{j}, because the operator measurement o_{j} is typically more accurate and precise than the inspectors's NDA measurement i_{j}. Provided (approximately), one can assume that d_{j} = (o_{j} − i_{j})/o_{j} is an adequate approximation to d_{j} = (o_{j} − i_{j})/μ_{j} [10]. Although IAEA experience suggests that sometimes exceeds 0.20, usually [8].
3.1 The D statistic to test for a trend in the individual differences d_{j} = o_{j} − i_{j}
For an additive error model, I_{ij} = μ_{ij} + S_{Ii} + R_{Iij}, it is known [11] that the variance of the D statistic is given by , where and are the absolute (not relative) variances. If one were sampling from a finite population without measurement error to estimate a population mean, then where f = (N − n)/N is the finite population correction factor, and σ^{2} is a quasivariance term (the “item variance” as defined previously in a slightly different context), defined here as . Notice that without any measurement error, if n = N then f = 0, so , which is quite different from . Figure 1 can be used to explain why when there are both random and systematic measurement errors. And, the fact that when n = N and there are no measurement errors is also easily explainable.
For a multiplicative error model (our focus), it can be shown [11] that (2) where and , and so to calculate in equation (2), one needs to know or assume values for (the item variance) and the average of the true values, . In equation (2), the first two terms are analogous to in the additive error model case. The third term involves and decreases to 0 when n = N. Again, in the limit as approaches zero, equation (2) reduces to that for the additive model case; and regardless whether is large or near zero, the effect of cannot be reduced by taking more measurements (increasing n in Eq. (2)).
In general, the multiplicative error model gives different results than an additive error model because variation in the true values, , contributes to in a multiplicative model, but not in an additive model. For example, let and , so that the average variance in the multiplicative model is the same as the variance in the additive model for both random and systematic errors. Assume δ_{R} = 0.10, δ_{S} = 0.02, (arbitrary units), and (50% relative standard deviation in the true values). Then the additive model has σ_{D} = 270.8 and the corresponding multiplicative model with the same average absolute variance has σ_{D} = 310.2, a 15% increase. The fact that var(μ) contributes to in a multiplicative model has an implication for sample size calculations such as those we describe in Section 4. Provided the magnitude of S_{Iij} + R_{Iij} is approximately 0.2 or less (equivalently, the relative standard deviation of S_{Iij} + R_{Iij} should be approximately 8% or less), one can convert equation (1) to an additive model by taking logarithms, using the approximation log(1 + x) ≈ x for x ≤ 0.20. However, there are many situations for which the log transform will not be sufficiently accurate, so this paper describes a recently developed option to accommodate multiplicative models rather than using approximations based on the logarithm transform [4,5].
The overall D test for a pattern is based on the average difference, . The Dstatistic test is based on equation (2), where is the random error variance and is the systematic error variance of d˜ = (o − i)/μ ≈ (o − i)/o, and is the absolute variance of the true (unknown) values. If the observed D value exceeds 3σ_{D} (or some similar multiple of σ_{D} to achieve a lot false alarm probability) then the D test alarms.
The test that alarms if D ≥ 3σ_{D} is actually testing whether D ≥ 3σˆ_{D}, where σˆ_{D} denotes an estimate of σ_{D}; this leads to two sample size evaluations. The first sample size n_{1} involves metrology data collected in previous inspection samples used to estimate , , and needed in equation (2). The second sample size n_{2} is the number of operator's declared measurements randomly selected for verification by the inspector. The sample size n_{1} consists of two sample sizes: the number of groups g (inspection periods) used to estimate and the total number of items over all groups, n_{1} = gn in the case (the only case we consider in examples in Sect. 4) that each group has n paired measurements.
3.2 Oneatatime sample verification tests
The IAEA has historically used zerodefect sampling, which means that the only acceptable (passing) sample is one for which no defects are found. Therefore, the nondetection probability is the probability that no defects are found in a sample of size n when one or more true defective items are in the population of size N. For oneitematatime testing, the nondetection probability is given by (3) where the term A_{i} is the probability that the selected sample contains i truly defective items, which is given by the hypergeometric distribution with parameters on i, n, N, r, where i is the number of defects in the sample, n is the sample size, N is the population size, and r is the number of defective items in the population. More specifically,
the above equation is the probability of choosing i defective items from r defective items in a population of size N in a sample of size n, which is the wellknown hypergeometric distribution. The term B_{i} is the probability that none of the i truly defective items is inferred to be defective based on the individual d tests. The value of B_{i} depends on the metrology and the alarm threshold. Assuming a multiplicative error model for the inspector measurement (and similarly for the operator), implies that, for an alarm threshold of k = 3, for we have to calculate , where , which is given by the multivariate normal integral
where each of the components of λ are equal to 1 SQ/r (SQ is a significant quantity; for example, 1 SQ = 8 kg for Pu, and r was defined above as the number of defective items in the population). The term ∑_{i} in the B_{i} calculation involved in the multivariate normal integral is a square matrix with i rows and columns with values on the diagonal and values on the offdiagonals.
4 Simulation study
The left hand side of equations (2) and (3) can be considered a “measurand” in the language used in the guide to expressing uncertainty in measurement [12]. Although the error propagation in the GUM is typically applied in a “bottomup” uncertainty evaluation of a measurement method, it can also be applied to any other output quantity y (such as y = σ_{D} or y = DP) expressed as a known function y = f(x_{1}, x_{2}, …, x_{p}) of inputs x_{1}, x_{2}, …, x_{p} (inputs such as and ). The GUM recommends linear approximations (“delta method”) or Monte Carlo simulations to propagate uncertainties in the inputs to predict uncertainties in the output. Here we use Monte Carlo simulations to evaluate the uncertainties in the inputs and and also to evaluate the uncertainty in y = σ_{D} or y = DP as a function of the uncertainties in the inputs. Notice that equation (2) is linear in and so the delta method to approximate the uncertainty in y = σ_{D} would be exact; however, there is nonzero covariance (a negative covariance) between and that would need to be taken into account in the delta method.
We used the statistical programming language R [13] to perform simulations for example true values of , and the amount of diverted nuclear material. For each of 10^{5} or more simulation runs, normal errors were generated assuming the multiplicative error model (1) for both random and systematic errors (see Sect. 4.2 for examples with nonnormal errors). The new version of the Grubbs' estimator for multiplicative errors was applied to produce the estimates , , , , and , which were then used to estimate y = σ_{D} in equation (2) and y = DP in equation (3). Because there is large uncertainty in the estimates , , , unless is nearly 0, we also present results for a modified Grubbs' estimator applied to the relative differences that estimates the aggregated variances and , and also estimates . Results are described in Sections 4.1 and 4.2.
4.1 The D statistic to test for a trend in the individual differences d_{j} = (o_{j} − i_{j})/o_{j}
Figure 3 plots 95% CIs for σ_{D} versus sample size n_{2} using the modified Grubbs' estimator applied to the relative differences for the parameter values δ_{RO} = 0.01, δ_{SO} = 0.001, δ_{RI} = 0.05, δ_{SI} = 0.005, , σ_{μ} = 0.01, N = 200 for case A (defined here and throughout as n_{1} = 4 with g = 2, n = 2) and for case B (defined here and throughout as n_{1} = 50 with g = 5, n = 10) . We used 10^{5} simulations of the measurement process to estimate the quantiles of the distribution of y = σ_{D}. We confirmed by repeating the sets of 10^{5} simulations that simulation error due to using a finite number of simulations is negligible. Clearly, and not surprisingly, the sample size in Case A leads to CI length that seems to be too wide for effectively quantifying the uncertainty in σ_{D}. The traditional Grubbs' estimator performs poorly unless σ_{μ} is very small, such as σ_{μ} = 0.0001. We use the traditional Grubbs' estimator in Section 4.2. The modified estimator that estimates the aggregated variances performs well for any value of σ_{μ}.
Figure 4 is similar to Figure 3, except Figure 4 plots the length of 95% CIs for 6 possible values of n_{1} (see the figure legend). Again, the case A sample size is probably too small for effective estimation of σ_{D}. In this example, the smallest length CI is for g = 5 and n = 100, but n = 100 is unrealistically large, while g = 3 and n = 10 or g = 5 and n = 10 are typically possible with reasonable resources. The length of these 95% CIs is one criterion to choose an effective sample size n_{1}.
Another criterion to choose an effective sample size n_{1} is the root mean squared error (RMSE, defined below) in estimating the sample size n_{2} needed to achieve σ_{D} = 8/3.3 (the 3.3 is an example value that corresponds to a 95% DP to detect an 8 kg shift (1 SQ for Pu) while maintaining a 0.05 FAP when testing for material loss). In this example, the RMSE in estimating the sample size n_{2} needed to achieve σ_{D} = 8/3.3 is approximately 12.9 for case A and 8.0, 7.3, 6.8, 6.7, and 6.3, respectively, for the other values of n_{1} considered in Figure 4. These RMSEs are repeatable to within ±0.1 across sets of 10^{5} simulations so the RMSE values are in the same order as the CI lengths in Figure 4. The RMSE is defined as where nˆ_{2,i} is the estimated sample size n_{2} in simulation i that is needed in order to achieve σ_{D} = 8/3.3, and n_{2,true} is the true sample size n_{2} (n_{2,true} = 22 in this example; see Fig. 3 where the true value of σ_{D} versus n_{2} is also shown) needed to achieve σ_{D} = 8/3.3.
Another criterion to choose an effective size n_{1} is the detection probability to detect specified loss scenarios. We consider this criterion in Section 4.3.
Fig. 3 The estimate of σ_{D} versus sample size n_{2} for two values of n_{1} (case A: g = 2, n = 2 so n_{1} = 4, or case B: g = 5, n = 10 so n_{1} = 50). 
Fig. 4 Estimated lengths of 95% confidence intervals for σ_{D} versus sample size n_{2} for six values of n_{1} (g = 2, n = 2 so n_{1} = 4, g = 3, n = 5 so n_{1} = 15, etc.). 
4.2 Uncertainty on the uncertainty on the uncertainty
The term “uncertainty” typically refers to a measurement error standard deviation, such as σ_{D}. Therefore, Figures 3 and 4 involve the “uncertainty of the uncertainty” as a function of n_{1} (defined as n_{1} = ng, so more correctly, as a function of g and n) and n_{2}. Figures 5–7 illustrate the “uncertainty of the uncertainty of the uncertainty” (we commit to stopping at this levelthree usage of “uncertainty”). The “uncertainty of the uncertainty” depends on the underlying measurement error probability density, which is sometimes itself uncertain. Figure 5 plots the familiar normal density and three nonnormal densities (uniform, gamma, and generalized lambda, [14]). Figure 6 plots the estimated probability density (using the 10^{5} realizations) of the estimated value of δ_{IR} using the traditional Grubbs' estimator for each of the four distributions (the true value of δ_{IR} is 0.05) and the five true standard deviations are the same as in Section 4.1 for generating the random variables (δ_{RO} = 0.01, δ_{SO} = 0.001, δ_{RI} = 0.05, δ_{SI} = 0.005, , σ_{μ} = 0.01, N = 200). Figure 7 is similar to Figure 3 (for g = 5, n = 10), except it compares CIs assuming the normal distribution to CIs assuming the generalized lambda distribution. That is, Figure 7 plots the estimated CI, again for the model parameters as above, for σ_{D} for the normal and for the generalized lambda distributions. In this case, the CIs are wider for the generalized lambda distribution than for the normal distribution. Recall (Fig. 5) that standard deviation of the four estimated probability densities are: 0.14, 0.25, 0.10, and 0.36 for the normal, gamma, uniform, and generalized lambda, respectively. Therefore, one might expect the CI for σ_{D} to be shorter for the normal than for a generalized lambda distribution that has the same relative standard deviation as the corresponding normal distribution.
Fig. 7 95% confidence intervals for the estimate of σ_{D} versus sample size n_{2} for case B, assuming the measurement error distribution is either the normal or the generalized lambda distribution. 
Fig. 6 The estimated probability density for δˆ_{IR} in the four example measurement error probability densities (normal, gamma, uniform, and generalized lambda, each with mean 0 and variance 1) from Figure 4. 
Fig. 5 Four example measurement error probability densities: normal, gamma, uniform, and generalized lambda, each with mean 0 and variance 1. 
4.3 Oneatatime testing
For oneatatime testing, Figure 8 plots 95% confidence intervals for the estimated DP versus sample size n_{2} for cases A and B (see Sect. 4.1). The true parameter values used in equation (3) were δ_{RO} = 0.1, δ_{SO} = 0.05, δ_{RI} = 0.1, δ_{SI} = 0.05, , σ_{μ} = 0.01. And, a true mean shift of 8 kg in each of 10 falsified items was used (representing data falsification by the operator to mask diversion of material). The CIs for the DP were estimated by using the observed 2.5% and 97.5% quantiles of the DP values in 10^{5} simulations. As in Section 4.1, we confirmed by repeating the sets of 10^{5} simulations that simulation error due to using a finite number of simulations is negligible. The very small case A sample leads to approximately the same lower 2.5% quantile as did case B; however, the upper 97.5% quantile is considerably lower for case A than for case B. Other values for the parameters (δ_{RO}, δ_{SO}, δ_{RI}, δ_{SI}, , σ_{μ}, the number of falsified items, and the amount falsified per item) lead to different conclusions about uncertainty as a function of n_{2} in how the DP decreases as a function of n_{2}. For example, if we reduce to in this example, then the confidence interval lengths are very short for both case A and case B.
For this same example, we can also compute the DP in using the D statistic to detect the loss (which the operator attempts to mask by falsifying the data). For the example just described (for which simulation results are shown in Fig. 8), the true DP in using the D statistic (using an alarm threshold of σ_{D} and n_{2} = 30 using Eq. (2)) is 0.65. The corresponding true DP for oneatatime testing is 0.27. Therefore, in this example, with 10 of 200 items falsified, each by an amount of 8 units, the D statistic has lower DP than the n_{2} = 30 oneatatime tests. In other examples, the D statistic will have higher DP, particularly when there are many falsified items in the population. For example, if we increase the number of defectives in this example from 10 of 200 to 20, 30, or 40 of 200, then the DPs are (0.17, 0.17), (0.08, 0.15), and (0.06, 0.14) for oneatatime testing and for the D statistic, respectively. These are low DPs, largely because the measurement error variances are large in this example. One can also assess the sensitivity of the estimated DP in using the D statistic to the uncertainty in the estimated variances; for brevity, we do not show that here.
Fig. 8 Estimated detection probability and 95% confidence interval versus sample size n_{2} for cases A and B. The true detection probability is plotted as the solid (black) line. 
5 Discussion and summary
This study was motivated by three considerations. First, there is an ongoing need to improve UQ for error variance estimation. For example, some applications involve characterizing items for longterm storage and the measurement error behaviour for the items is not well known, so an initial metrology study with tobedetermined sample sizes is required. Second, we recently provided the capability to allow for multiplicative error models in evaluating the D statistic (Eq. (2)) [4,5]. Third, we recently provided the capability to allow for both random and systematic errors in oneatatime item testing (Eq. (3)).
We presented a simulation study that assumed error variances are estimated using an initial metrology study characterized by g measurement groups and n paired operator, inspector measurements per group. Not surprisingly, both oneitematatime testing and pattern testing using the D statistic, it appears that g = 2 and n = 2 is too small for effective variance estimation.
Therefore, the sample sizes in the previous and current inspections will impact the estimated DP and FAP, as is illustrated by numerical examples. The numerical examples include application of the new expression for the variance of the D statistic assuming the measurement error model is multiplicative (Eq. (2)) is used in a simulation study and new application of both random and systematic error variances in oneitematatime testing (Eq. (3)).
Future work will evaluate the impact of larger values of product variability, on the standard Grubbs' estimator; this study used a very small value of , which is adequate in some contexts, such as product streams. The value of could be considerably larger in some NM streams, particularly waste streams. Therefore, this study also evaluated the relative differences d_{j} = (o_{j} − i_{j})/o_{j} to estimate the aggregated quantities needed in equations (2) and (3), , using a modified Grubbs' estimation, to mitigate the impact of noise in estimation of σ_{μ}. Because is a source of noise in estimating the individual measurement error variances [15], a Bayesian alternative is under investigation to reduce its impact [16]. Also, one could base a statistical test for data falsification based on the relative differences between operator and inspector measurements d = (o − i)/o in which case an alternate expression to equation (2) for σ_{D} that does not involve the product variability would be used.
5.1 Implications and influences
This study was motivated by three considerations, each of which have implications for future work. First, there is an ongoing need to improve UQ for error variance estimation. For example, some applications involve characterizing items for longterm storage and the measurement error behaviour might not be well known for the items, so an initial metrology study with tobedetermined sample sizes is required. Second, we recently provided the capability to allow for multiplicative error models in evaluating the D statistic (Eq. (2) in Sect. 3) [4,5]. Third, we recently provided the capability to allow for both random and systematic errors in oneatatime item testing (Eq. (3) in Sect. 3). Previous to this work, the variance of the D statistic was estimated by assuming measurement error models are additive rather than multiplicative, and oneatatime item testing assumed that all measurement errors were purely random.
Acknowledgments
The authors acknowledge CETAMA for hosting the November 17–19, 2015 conference on sampling and characterizing where this paper was first presented.
References
 R. Avenhaus, M. Canty, Compliance Quantified (Cambridge University Press, 1996) [CrossRef] [Google Scholar]
 T. Burr, M.S. Hamada, Revisiting statistical aspects of nuclear material accounting, Sci. Technol. Nucl. Install. 2013, 961360 (2013) [Google Scholar]
 T. Burr, M.S. Hamada, Bayesian updating of material balances covariance matrices using training data, Int. J. Prognost. Health Monitor. 5, 6 (2014) [Google Scholar]
 E. Bonner, T. Burr, T. Guzzardo, T. Krieger, C. Norman, K. Zhao, D.H. Beddingfield, W. Geist, M. Laughter, T. Lee, Ensuring the effectiveness of safeguards through comprehensive uncertainty quantification, J. Nucl. Mater. Manage. 44, 53 (2016) [Google Scholar]
 T. Burr, T. Krieger, K. Zhao, Grubbs' estimators in multiplicative error models, IAEA report, 2015 [Google Scholar]
 F. Grubbs, On estimating precision of measuring instruments and product variability, J. Am. Stat. Assoc. 43, 243 (1948) [CrossRef] [Google Scholar]
 K. Martin, A. Böckenhoff, Analysis of shortterm systematic measurement error variance for the difference of paired data without repetition of measurement, Adv. Stat. Anal. 91, 291 (2007) [CrossRef] [Google Scholar]
 R. Miller, Beyond ANOVA: Basics of Applied Statistics (Chapman & Hall, 1998) [Google Scholar]
 C. Norman, Measurement errors and their propagation, Internal IAEA Document, 2014 [Google Scholar]
 G. Marsaglia, Ratios of normal variables, J. Stat. Softw. 16, 2 (2006) [CrossRef] [Google Scholar]
 T. Burr, T. Krieger, K. Zhao, Variations of the D statistics for additive and multiplicative error models, IAEA report, 2015 [Google Scholar]
 Guide to the Expression of Uncertainty in Measurement, JCGM 100: www.bipm.org (2008) [Google Scholar]
 R Core Team R, A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2012): www.Rproject.org [Google Scholar]
 M. Freimer, G. Mudholkar, G. Kollia, C. Lin, A study of the generalized Tukey Lambda family, Commun. Stat. Theor. Methods 17, 3547 (1988) [CrossRef] [Google Scholar]
 F. Lombard, C. Potgieter, Another look at Grubbs' estimators, Chemom. Intell. Lab. Syst. 110, 74 (2012) [CrossRef] [Google Scholar]
 C. Elster, Bayesian uncertainty analysis compared to the application of the gum and its supplements, Metrologia 51, S159 (2014) [CrossRef] [Google Scholar]
Cite this article as: Tom Burr, Thomas Krieger, Claude Norman, Ke Zhao, The impact of metrology study sample size on uncertainty in IAEA safeguards calculations, EPJ Nuclear Sci. Technol. 2, 36 (2016)
All Figures
Fig. 2 Example real verification measurement data. (a) Four paired (O,I) measurements in three inspection periods; (b) inspector vs. operator measurement by group, with linear fits in each group. 

In the text 
Fig. 1 Example simulated verification measurement data. The relative difference d˜ = (o − i)/o is plotted for each of 10 paired (o,i) measurements in each of 5 groups, for a total of 50 relative differences. The mean relative difference within each group (inspection period) is indicated by a horizontal line through the respective group means of the paired differences. 

In the text 
Fig. 3 The estimate of σ_{D} versus sample size n_{2} for two values of n_{1} (case A: g = 2, n = 2 so n_{1} = 4, or case B: g = 5, n = 10 so n_{1} = 50). 

In the text 
Fig. 4 Estimated lengths of 95% confidence intervals for σ_{D} versus sample size n_{2} for six values of n_{1} (g = 2, n = 2 so n_{1} = 4, g = 3, n = 5 so n_{1} = 15, etc.). 

In the text 
Fig. 7 95% confidence intervals for the estimate of σ_{D} versus sample size n_{2} for case B, assuming the measurement error distribution is either the normal or the generalized lambda distribution. 

In the text 
Fig. 6 The estimated probability density for δˆ_{IR} in the four example measurement error probability densities (normal, gamma, uniform, and generalized lambda, each with mean 0 and variance 1) from Figure 4. 

In the text 
Fig. 5 Four example measurement error probability densities: normal, gamma, uniform, and generalized lambda, each with mean 0 and variance 1. 

In the text 
Fig. 8 Estimated detection probability and 95% confidence interval versus sample size n_{2} for cases A and B. The true detection probability is plotted as the solid (black) line. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.