The impact of metrology study sample size on uncertainty in IAEA safeguards calculations

Quantitative conclusions by the International Atomic Energy Agency (IAEA) regarding States' nuclear material inventories and flows are provided in the form of material balance evaluations (MBEs). MBEs use facility estimates of the material unaccounted for together with verification data to monitor for possible nuclear material diversion. Verification data consist of paired measurements (usually operators' declarations and inspectors' verification results) that are analysed one-item-at-a-time to detect significant differences. Also, to check for patterns, an overall difference of the operator-inspector values using a “D (difference) statistic” is used. The estimated DP and false alarm probability (FAP) depend on the assumed measurement error model and its random and systematic error variances, which are estimated using data from previous inspections (which are used for metrology studies to characterize measurement error variance components). Therefore, the sample sizes in both the previous and current inspections will impact the estimated DP and FAP, as is illustrated by simulated numerical examples. The examples include application of a new expression for the variance of the D statistic assuming the measurement error model is multiplicative and new application of both random and systematic error variances in one-item-at-a-time testing.


Introduction, background, and implications
Nuclear material accounting (NMA) is a component of nuclear safeguards, which are designed to deter and detect illicit diversion of nuclear material (NM) from the peaceful fuel cycle for weapons purposes.NMA consists of periodically comparing measured NM inputs to measured NM outputs, and adjusting for measured changes in inventory.Avenhaus and Canty [1] describe quantitative diversion detection options for NMA data, which can be regarded as time series of residuals.For example, NMA at large throughput facilities closes the material balance (MB) approximately every 10 to 30 days around an entire material balance area, which typically consists of multiple process stages [2,3].
The MB is defined as MB = I begin þ T in À T out À I end , where T in is transfers in, T out is transfers out, I begin is beginning inventory, and I end is ending inventory.The measurement error standard deviation of the MB is denoted s MB .Because many measurements enter the MB calculation, the central limit theorem, and facility experience imply that MB sequences should be approximately Gaussian.
To monitor for possible data falsification by the operator that could mask diversion, paired (operator, inspector) verification measurements are assessed by using one-item-at-a-time testing to detect significant differences, and also by using an overall difference of the operatorinspector values (the "D (difference) statistic") to detect overall trends.These paired data are declarations usually based on measurements by the operator, often using DA, and measurements by the inspector, often using NDA.The D statistic is commonly defined as D ¼ N P n j¼1 ðO j À I j Þ=n, applied to paired (O j ,I j ) where j indexes the sample items, O j is the operator declaration, I j is the inspector measurement, n is the verification sample size, and N is the total number of items in the stratum.Both the D statistic and the one-item-at-a-time tests rely on estimates of operator and inspector measurement uncertainties that are based on empirical uncertainty quantification (UQ).The empirical UQ uses paired (O j ,I j ) data from previous inspection periods in metrology studies to characterize measurement error variance components, as we explain below.Our focus is a sensitivity analysis of the impact of the uncertainty in the measurement error variance components (that are estimated using the prior verification (O j ,I j ) data) on sample size calculations in IAEA verifications.Such an assessment depends on the assumed measurement error model and associated uncertainty components, so it is important to perform effective UQ.
This paper is organized as follows.Section 2 describes measurement error models and error variance estimation using Grubbs' estimation [4][5][6].Section 3 describes statistical tests based on the D statistic and one-verification-item-at-a-time testing.Section 4 gives simulation results that describe inference quality as a function of two sample sizes.The first sample size n 1 is the metrology study sample size (from previous inspection periods) used to estimate measurement error variances using Grubbs' (or similar) estimation methods.The second sample size n 2 is the number of verification items from a population of size N. Section 5 is a discussion, summary, and implications.

Measurement error models
The measurement error model must account for variation within and between groups, where a group is, for example, a calibration or inspection period.The measurement error model used for safeguards sets the stage for applying an analysis of variance (ANOVA) with random effects [4,[6][7][8][9].If the errors tend to scale with the true value, then a typical model for multiplicative errors is where I ij is the inspector's measured value of item j (from 1 to n) in group i (from 1 to g), m ij is the true but unknown value of item j from group i, s 2 m is the "item variance", defined here as s 2 m ¼ Þ is a random error of item j from group i, and The term s 2 m is the called "product variability" by Grubbs [6].Neither R Iij nor S Ii are observable from data.However, for various types of observed data, we can estimate the variances d 2 RI and d 2 SI .The same error model is typically also used for the operator, but with We use capital letters such as I and O to denote random variables and corresponding lower case letters i and o to denote the corresponding observed values.Figure 1 plots simulated example verification measurement data.The relative difference d~= (o À i)/o is plotted for each of 10 paired (o,i) measurements in each of 5 groups (inspection periods), for a total of 50 relative differences.As shown in Figure 1, typically, the between-group variation is noticeable compared to the within-group variation, although the between-group variation is amplified to a quite large value for better illustration in Figure 1; we used d RO = 0.005, d SO = 0.001, d RI = 0.01, d SI = 0.03, and the value d SI = 0.03 is quite large.Figure 2a is the same type of plot as Figure 1, but is for real data (four operator and inspector measurements on drums of UO 2 powder from each of three inspection periods).Figure 2b plots inspector versus operator data for each of the three inspection periods; a linear fit is also plotted.

Grubbs' estimator for paired (operator, inspector) data
Grubbs introduced a variance estimator for paired data under the assumption that the measurement error model was additive.We have developed new versions of the Grubbs' estimator to accommodate multiplicative error models and/or prior information regarding the relative sizes of the true variances [4,5].Grubbs' estimator was developed for the situation in which more than one measurement method is applied to multiple test items, but there is no replication of measurements by any of the methods.This is the typical situation in paired (O,I) data.
Grubbs' estimator for an additive error model can be extended to apply to the multiplicative model equation (1) as follows.First, equation (1) for the inspector data (the operator data is analysed in the same way) implies that the within-group mean squared error (MSE), where m is the average value of m ij (assuming that each group has the same number of paired observations n).Second, the between-group MSE, ð m in both equations is estimated as in the additive error model, by using the fact that the covariance between operator and inspector measurements equals s 2 m [4,5].However, s 2 m will be estimated with non-negligible estimation error in many cases.For example, see Figure 2b where the fitted lines in periods 1 and 3 have negative slope, which implies that the estimate of s 2 m is negative in periods 1 and 3 (but the true value of s 2 m cannot be negative in this situation).We note that in the limit as s 2 m approaches zero, the expression for the within-group MSE reduces to that in the additive model case (and similarly for the between-group MSE).

Applying uncertainty estimates: the D statistic and one-at-a-time-verification measurements
This paper considers two possible IAEA verification tests.First, the overall D test for a pattern is based on the average difference, D ¼ N P n j¼1 ðO j À I j Þ=n.Second, the one-at-a-time test compares the operator to the corresponding inspector measurement for each item and a relative difference is computed, defined as IS (or some other alarm threshold close to the value of 3 that corresponds to a small false alarm probability), then the jth item selected for verification leads to an alarm.Note that the correct normalization used to define the relative difference is actually d j = (o j À i j )/m j , which has standard deviation exactly d.But m j is not known in practice, so a reasonable approximation is to use d j = (o j À i j )/o j , because the operator measurement o j is typically more accurate and precise than the inspectors's 3.1 The D statistic to test for a trend in the individual differences For an additive error model, SI are the absolute (not relative) variances.If one were sampling from a finite population without measurement error to estimate a population mean, then s 2 D ¼ N 2 ðs 2 =nÞððN À nÞ=NÞ; where f = (N À n)/N is the finite population correction factor, and s 2 is a quasivariance term (the "item variance" as defined previously in a slightly different context), defined here as operator measurement by group, with linear fits in each group.
when there are both random and systematic measurement errors.And, the fact that s 2 D ¼ N 2 ðs 2 =nÞf ¼ 0 when n = N and there are no measurement errors is also easily explainable.
For a multiplicative error model (our focus), it can be shown [11] that where , so that the average variance in the multiplicative model is the same as the variance in the additive model for both random and systematic errors.Assume d R = 0.10, d S = 0.02, m ¼ 100 (arbitrary units), and s 2 m ¼ 2500 (50% relative standard deviation in the true values).Then the additive model has s D = 270.8and the corresponding multiplicative model with the same average absolute variance has s D = 310.2, a 15% increase.The fact that var(m) contributes to s 2 D in a multiplicative model has an implication for sample size calculations such as those we describe in Section 4. Provided the magnitude of S Iij þ R Iij is approximately 0.2 or less (equivalently, the relative standard deviation of S Iij þ R Iij should be approximately 8% or less), one can convert equation ( 1) to an additive model by taking logarithms, using the approximation log (1 þ x) ≈ x for |x| 0.20.However, there are many situations for which the log transform will not be sufficiently accurate, so this paper describes a recently developed option to accommodate multiplicative models rather than using approximations based on the logarithm transform [4,5].
The overall D test for a pattern is based on the average difference, D ¼ N P n j¼1 ðO j À I j Þ=n.The D-statistic test is based on equation (2), where , and s 2 m needed in equation ( 2).The second sample size n 2 is the number of operator's declared measurements randomly selected for verification by the inspector.The sample size n 1 consists of two sample sizes: the number of groups g (inspection periods) used to estimate d 2 S and the total number of items over all groups, n 1 = gn in the case (the only case we consider in examples in Sect.4) that each group has n paired measurements.

One-at-a-time sample verification tests
The IAEA has historically used zero-defect sampling, which means that the only acceptable (passing) sample is one for which no defects are found.Therefore, the non-detection probability is the probability that no defects are found in a sample of size n when one or more true defective items are in the population of size N.For one-item-at-a-time testing, the non-detection probability is given by Probðdiscover 0 defects in sample of size nÞ where the term A i is the probability that the selected sample contains i truly defective items, which is given by the hypergeometric distribution with parameters on i, n, N, r, where i is the number of defects in the sample, n is the sample size, N is the population size, and r is the number of defective items in the population.More specifically, the above equation is the probability of choosing i defective items from r defective items in a population of size N in a sample of size n, which is the well-known hypergeometric distribution.The term B i is the probability that none of the i truly defective items is inferred to be defective based on the individual d tests.The value of B i depends on the metrology and the alarm threshold.Assuming a multiplicative error model for the inspector measurement (and similarly for the operator), implies that, for an alarm threshold of k = 3, for Dj ¼ ððO , which is given by the multivariate normal integral where each of the components of l are equal to 1 SQ/r (SQ is a significant quantity; for example, 1 SQ = 8 kg for Pu, and r was defined above as the number of defective items in the population).The term P i in the B i calculation involved in the multivariate normal integral is a square matrix with i rows and columns with values

on the diagonal and values d 2
S on the off-diagonals.

Simulation study
The left hand side of equations ( 2) and (3) can be considered a "measurand" in the language used in the guide to expressing uncertainty in measurement [12].Although the error propagation in the GUM is typically applied in a "bottom-up" uncertainty evaluation of a measurement method, it can also be applied to any other output quantity y (such as y = s D or y = DP) expressed as a known function y = f(x 1 , x 2 , . . ., x p ) of inputs x 1 , x 2 , . . ., x p (inputs such as ; and s 2 m ).The GUM recommends linear approximations ("delta method") or Monte Carlo simulations to propagate uncertainties in the inputs to predict uncertainties in the output.Here we use Monte Carlo simulations to evaluate the uncertainties in the inputs ; and s 2 m and also to evaluate the uncertainty in y = s D or y = DP as a function of the uncertainties in the inputs.Notice that equation ( 2) is linear in d 2 R and d 2 S ; so the delta method to approximate the uncertainty in y = s D would be exact; however, there is non-zero covariance (a negative covariance) between d2 R and d2 S that would need to be taken into account in the delta method.
We used the statistical programming language R [13]  IS unless s 2 m is nearly 0, we also present results for a modified Grubbs' estimator applied to the relative differences Dj ¼ ðO j À I j Þ=O j that estimates the aggregated variances , and also estimates s 2 m .Results are described in Sections 4.1 and 4.2.

4.1
The D statistic to test for a trend in the individual differences d j = (o j À i j )/o j Figure 3 plots 95% CIs for s D versus sample size n 2 using the modified Grubbs' estimator applied to the relative differences Dj ¼ ðO j À I j Þ=O j for the parameter values d RO = 0.01, d SO = 0.001, d RI = 0.05, d SI = 0.005, m ¼ 1, s m = 0.01, N = 200 for case A (defined here and throughout as n 1 = 4 with g = 2, n = 2) and for case B (defined here and throughout as n 1 = 50 with g = 5, n = 10) .We used 10 5 simulations of the measurement process to estimate the quantiles of the distribution of y = s D .We confirmed by repeating the sets of 10 5 simulations that simulation error due to using a finite number of simulations is negligible.Clearly, and not surprisingly, the sample size in Case A leads to CI length that seems to be too wide for effectively quantifying the uncertainty in s D .The traditional Grubbs' estimator performs poorly unless s m is very small, such as s m = 0.0001.We use the traditional Grubbs' Figure 4 is similar to Figure 3, except Figure 4 plots the length of 95% CIs for 6 possible values of n 1 (see the figure legend).Again, the case A sample size is probably too small for effective estimation of s D .In this example, the smallest length CI is for g = 5 and n = 100, but n = 100 is unrealistically large, while g = 3 and n = 10 or g = 5 and n = 10 are typically possible with reasonable resources.The length of these 95% CIs is one criterion to choose an effective sample size n 1 .
Another criterion to choose an effective sample size n 1 is the root mean squared error (RMSE, defined below) in estimating the sample size n 2 needed to achieve s D = 8/3.3(the 3.3 is an example value that corresponds to a 95% DP to detect an 8 kg shift (1 SQ for Pu) while maintaining a 0.05 FAP when testing for material loss).In this example, the RMSE in estimating the sample size n 2 needed to achieve s D = 8/3.3 is approximately 12.9 for case A and 8.0, 7.3, 6.8, 6.7, and 6.3, respectively, for the other values of n 1 considered in Figure 4.These RMSEs are repeatable to within ±0.1 across sets of 10 5 simulations so the RMSE values are in the same order as the CI lengths in Figure 4.The RMSE is defined as where n ^2,i is the estimated sample size n 2 in simulation i that is needed in order to achieve s D = 8/3.3,and n 2,true is the true sample size n 2 (n 2,true = 22 in this example; see Fig. 3 where the true value of s D versus n 2 is also shown) needed to achieve s D = 8/3.3.
Another criterion to choose an effective size n 1 is the detection probability to detect specified loss scenarios.We consider this criterion in Section 4.3.

Uncertainty on the uncertainty on the uncertainty
The term "uncertainty" typically refers to a measurement error standard deviation, such as s D .Therefore, Figures 3  and 4 involve the "uncertainty of the uncertainty" as a function of n 1 (defined as n 1 = ng, so more correctly, as a function of g and n) and n 2 .Figures 5-7 illustrate the "uncertainty of the uncertainty of the uncertainty" (we commit to stopping at this level-three usage of "uncertainty").The "uncertainty of the uncertainty" depends on the underlying measurement error probability density, which is sometimes itself uncertain.Figure 5 plots the familiar normal density and three non-normal densities (uniform, gamma, and generalized lambda, [14]).Figure 6 plots the estimated probability density (using the 10 5 realizations) of the estimated value of d IR using the traditional Grubbs' estimator for each of the four distributions (the true value of d IR is 0.05) and the five true standard deviations are the same as in Section 4.1 for generating the random variables (d RO = 0.01, d SO = 0.001, d RI = 0.05, d SI = 0.005, m ¼ 1, s m = 0.01, N = 200).Figure 7 is similar to Figure 3 (for g = 5, n = 10), except it compares CIs assuming the normal distribution to CIs assuming the generalized lambda distribution.That is, Figure 7 plots the estimated CI, again for the model parameters as above, for s D for the normal and for the generalized lambda distributions.In this case, the CIs are wider for the generalized lambda distribution than for the normal distribution.Recall (Fig. 5) that standard deviation of the four estimated probability densities are: 0.14, 0.25, 0.10, and 0.36 for the normal, gamma, uniform, and generalized lambda, respectively.Therefore, one might expect the CI for s D to be shorter for the normal than for a generalized lambda distribution that has the same relative standard deviation as the corresponding normal distribution.

One-at-a-time testing
For one-at-a-time testing, Figure 8      falsified items, and the amount falsified per item) lead to different conclusions about uncertainty as a function of n 2 in how the DP decreases as a function of n 2 .For example, if we reduce m ¼ 15 to m ¼ 1 in this example, then the confidence interval lengths are very short for both case A and case B.
For this same example, we can also compute the DP in using the D statistic to detect the loss (which the operator attempts to mask by falsifying the data).For the example just described (for which simulation results are shown in Fig. 8), the true DP in using the D statistic (using an alarm threshold of s D and n 2 = 30 using Eq. ( 2)) is 0.65.The corresponding true DP for one-at-a-time testing is 0.27.Therefore, in this example, with 10 of 200 items falsified, each by an amount of 8 units, the D statistic has lower DP than the n 2 = 30 one-at-a-time tests.In other examples, the D statistic will have higher DP, particularly when there are many falsified items in the population.For example, if we increase the number of defectives in this example from 10 of 200 to 20, 30, or 40 of 200, then the DPs are (0.17, 0.17), (0.08, 0.15), and (0.06, 0.14) for one-at-a-time testing and for the D statistic, respectively.These are low DPs, largely because the measurement error variances are large in this example.One can also assess the sensitivity of the estimated DP in using the D statistic to the uncertainty in the estimated variances; for brevity, we do not show that here.

Discussion and summary
This study was motivated by three considerations.First, there is an ongoing need to improve UQ for error variance estimation.For example, some applications involve characterizing items for long-term storage and the measurement error behaviour for the items is not well known, so an initial metrology study with to-be-determined sample sizes is required.Second, we recently provided the capability to allow for multiplicative error models in evaluating the D statistic (Eq.( 2)) [4,5].Third, we recently provided the capability to allow for both random and systematic errors in one-at-a-time item testing (Eq.( 3)).
We presented a simulation study that assumed error variances are estimated using an initial metrology study characterized by g measurement groups and n paired operator, inspector measurements per group.Not surprisingly, both one-item-at-a-time testing and pattern testing using the D statistic, it appears that g = 2 and n = 2 is too small for effective variance estimation.
Therefore, the sample sizes in the previous and current inspections will impact the estimated DP and FAP, as is illustrated by numerical examples.The numerical examples include application of the new expression for the variance of the D statistic assuming the measurement error model is multiplicative (Eq.( 2)) is used in a simulation study and new application of both random and systematic error variances in one-item-at-a-time testing (Eq.( 3)).Future work will evaluate the impact of larger values of product variability, s 2 m on the standard Grubbs' estimator; this study used a very small value of s 2 m , which is adequate in some contexts, such as product streams.The value of s 2 m could be considerably larger in some NM streams, particularly waste streams.Therefore, this study also evaluated the relative differences d j = (o j À i j )/o j to estimate the aggregated quantities needed in equations ( 2) and , using a modified Grubbs' estimation, to mitigate the impact of noise in estimation of s m .Because s 2 m is a source of noise in estimating the individual measurement error variances [15], a Bayesian alternative is under investigation to reduce its impact [16].Also, one could base a statistical test for data falsification based on the relative differences between operator and inspector measurements d = (o À i)/o in which case an alternate expression to equation (2) for s D that does not involve the product variability s 2 m would be used.

Implications and influences
This study was motivated by three considerations, each of which have implications for future work.First, there is an ongoing need to improve UQ for error variance estimation.For example, some applications involve characterizing items for long-term storage and the measurement error behaviour might not be well known for the items, so an initial metrology study with to-be-determined sample sizes is required.Second, we recently provided the capability to allow for multiplicative error models in evaluating the D statistic (Eq.(2) in Sect.3) [4,5].Third, we recently provided the capability to allow for both random and systematic errors in one-at-a-time item testing (Eq.(3) in Sect.3).Previous to this work, the variance of the D statistic was estimated by assuming measurement error models are additive rather than multiplicative, and one-ata-time item testing assumed that all measurement errors were purely random.

Fig. 1 .
Fig. 1.Example simulated verification measurement data.The relative difference d ~= (o À i)/o is plotted for each of 10 paired (o,i) measurements in each of 5 groups, for a total of 50 relative differences.The mean relative difference within each group (inspection period) is indicated by a horizontal line through the respective group means of the paired differences.

Fig. 2 .
Fig. 2. Example real verification measurement data.(a) Four paired (O,I) measurements in three inspection periods; (b) inspector vs. operator measurement by group, with linear fits in each group.
plots 95% confidence intervals for the estimated DP versus sample size n 2 for cases A and B (see Sect. 4.1).The true parameter values used in equation (3) were d RO = 0.1, d SO = 0.05, d RI = 0.1,

Fig. 6 .
Fig. 6.The estimated probability density for d ^IR in the four example measurement error probability densities (normal, gamma, uniform, and generalized lambda, each with mean 0 and variance 1) from Figure 4.

Fig. 7 .
Fig. 7. 95% confidence intervals for the estimate of s D versus sample size n 2 for case B, assuming the measurement error distribution is either the normal or the generalized lambda distribution.

Fig. 8 .
Fig. 8.Estimated detection probability and 95% confidence interval versus sample size n 2 for cases A and B. The true detection probability is plotted as the solid (black) line.
mÞ 2 Þ= ðN À 1Þ, and so to calculate s 2 D in equation (2), one needs to know or assume values for s 2 m (the item variance) and the average of the true values, m.In equation (2), the first two terms are analogous toN 2 ððs 2 R =nÞ þ s 2 S Þin the additive error model case.The third term involves s 2 m and decreases to 0 when n = N.Again, in the limit as s 2 m approaches zero, equation (2) reduces to that for the additive model case; and regardless whether s 2 m is large or near zero, the effect of d 2 Scannot be reduced by taking more measurements (increasing n in Eq. (2)).In general, the multiplicative error model gives different results than an additive error model because variation in the true values, s 2 m , contributes to s 2 D in a multiplicative model, but not in an additive model.For example, let and s 2 m is the absolute variance of the true (unknown) values.If the observed D value exceeds 3s D (or some similar multiple of s D to achieve a lot false alarm probability) then the D test alarms.The test that alarms if D ≥ 3s D is actually testing whether D ≥ 3s ^D, where s ^D denotes an estimate of s D ; this leads to two sample size evaluations.The first sample size n 1 involves metrology data collected in previous inspection samples used to estimate d