Estimating model bias over the complete nuclide chart with sparse Gaussian processes at the example of INCL/ABLA and double-differential neutron spectra

Predictions of nuclear models guide the design of nuclear facilities to ensure their safe and efficient operation. Because nuclear models often do not perfectly reproduce available experimental data, decisions based on their predictions may not be optimal. Awareness about systematic deviations between models and experimental data helps to alleviate this problem. This paper shows how a sparse approximation to Gaussian processes can be used to estimate the model bias over the complete nuclide chart at the example of inclusive double-differential neutron spectra for incident protons above 100\,MeV. A powerful feature of the presented approach is the ability to predict the model bias for energies, angles, and isotopes where data are missing. The number of experimental data points that can be taken into account is at least in the order of magnitude of~$10^4$ thanks to the sparse approximation. The approach is applied to the Li\`ege Intranuclear Cascade Model (INCL) coupled to the evaporation code ABLA. The results suggest that sparse Gaussian process regression is a viable candidate to perform global and quantitative assessments of models. Limitations of a philosophical nature of this (and any other) approach are also discussed.


Introduction
Despite theoretical advances, nuclear models are in general not able to reproduce all features of trustworthy experimental data. Because experiments alone do not provide sufficient information to solve problems of nuclear engineering, models are still needed to fill the gaps.
Being in need of reliable nuclear data, a pragmatic solution to deal with imperfect models is the introduction of a bias term on top of the model. The true prediction is then given as the sum of model prediction and bias term. The form of the bias term is in principle arbitrary and a low order polynomial could be a possible choice. However, assumptions about how models deviate from reality are usually very vague, which makes it difficult to justify one parametrization over another one.
Methods of non-parametric statistics help to a certain extent to overcome this specification problem. In particular, Gaussian process (GP) regression (also known as Kriging, e.g., [1]) enjoys popularity in various fields, such as geostatistics, remote sensing and robotics, due to its conceptual simplicity and sound embedding in Bayesian statistics. Instead of providing a parameterized function to fit, one specifies a mean function and a covariance function.
The definition of these two quantities induces a prior probability distribution on a function space. Several standard specifications of parametrized covariance functions exist, e.g., [1,Chap. 4], whose parameters regulate the smoothness and the magnitude of variation of the function to be determined by GP regression.
The optimal choice of the values for these so-called hyperparameters is problem dependent and can also be automatically performed by methods such as marginal likelihoodoptimizationandcross validation,e.g., [1,Chap.5]. In addition, features of various covariance functions can be combined by summing and multiplying them. Both the automatic determination of hyperparameters and the combination of covariance functions will be demonstrated.
From an abstract viewpoint, GP regression is a method to learn a functional relationship based on samples of input-output associations without the need to specify a functional shape. GP regression naturally yields besides estimates also the associated uncertainties. This feature is essential for evaluating nuclear data because uncertainties of estimates are as important for the design of nuclear facilities as are the estimates themselves. Prior knowledge about the smoothness and the magnitude of the model bias can be taken into account. Furthermore, many existing nuclear data evaluation methods, e.g., [2][3][4], can be regarded as special cases. This suggests that the approach discussed in this paper can be combined with existing evaluation methods in a principled way.
The main hurdle for applying GP regression is the bad scalability in the number of data points N. The required inversion of an N Â N covariance matrix leads to a computational complexity of N 3 . This limits the application of standard GP regression to several thousand data points on contemporary desktop computers. Scaling up GP regression to large datasets is therefore a field of active research. Approaches usually rely on a combination of parallel computing and the introduction of sparsity in the covariance matrix, which means to replace the original covariance matrix by a low rank approximation, see e.g., [5].
In this paper, I investigate the sparse approximation introduced in [6] to estimate the model bias of inclusive double-differential neutron spectra over the complete nuclide chart for incident protons above 100 MeV. The predictions are computed by the C++ version of the Liège intranuclear cascade model (INCL) [7,8] coupled to the Fortran version of the evaporation code ABLA07 [9]. The experimental data are taken from the EXFOR database [10].
The idea of using Gaussian processes to capture deficiencies of a model exists for a long time in the literature, see e.g., [11], and has also already been studied in the context of nuclear data evaluation, e.g., [12][13][14]. The novelty of this contribution is the application of GP regression to a large dataset with isotopes across the nuclide chart, which is possible thanks to the sparse approximation. Furthermore, the way GP regression is applied enables predictions for isotopes without any data.
The exemplary application of sparse GP regression in this paper indicates that the inclusion of hundred thousands of data points may be feasible and isotope extrapolations yield reasonable results if some conditions are met. Therefore, sparse GP regression is a promising candidate to perform global assessments of models and to quantify their imperfections in a principled way.
The structure of this paper is as follows. Section 2 outlines the theory underlying sparse GP regression. In particular, Section 2.1 provides a succinct exposition of standard GP regression, Section 2.2 sketches how to construct a sparse approximation to a GP and how this approximation is exploited in the computation, and Section 2.3 explains the principle to adjust the hyperparameters of the covariance function based on the data.
The application to INCL/ABLA and inclusive doubledifferential neutron spectra is then discussed in Section 3. After a brief introduction of INCL and ABLA in Section 3.1, the specific choice of covariance function is detailed in Section 3.2. Some details about the hyperparameter adjustment are given in Section 3.3 and the results of GP regression are shown and discussed in Section 3.4.

Method
2.1 GP regression GP regression, e.g., [1], can be derived in the framework of Bayesian statistics under the assumption that all probability distributions are multivariate normal. Let the vector y 1 contain the values at the locations fx p i g of interest and y 2 the observed values at the locations fx o j g.
For instance, in the application in Section 3, the elements in the vector y 1 represent the relative deviations of the "truth" from the model predictions for neutron spectra at angles and energies of interest. The vector y 2 contains the relative deviations of the available experimental data from the model predictions for neutron spectra at the angles and energies of the experiments. The underlying assumption is that the experimental measurements differ from the truth by an amount compatible with their associated uncertainties. If this assumption does not hold, model bias should be rather understood as a combination of model and experimental bias.
Given a probabilistic relationship between the vectors y 1 and y 2 , i.e. we know the conditional probability density function (pdf) of y 2 given y 1 , r(y 2 |y 1 ) (e.g., because of continuity assumptions), the application of the Bayesian update formula yields rðy 1 jy 2 Þ∝rðy 2 jy 1 Þrðy 1 Þ: ð1Þ The occurring pdfs are referred to as posterior rðy 1 jy 2 Þ, likelihood rðy 2 jy 1 Þ, and prior rðy 1 Þ. The posterior pdf represents an improved state of knowledge.
The form of the pdfs in the Bayesian update formula can be derived from the joint distribution rðy 1 ; y 2 Þ. In the following, we need the multivariate normal distribution which is characterized by the center vector m and the covariance matrix K. The dimension of the occurring vectors is denoted by N. Under the assumption that all pdfs are multivariate normal and centered at zero, the joint distribution of y 1 and y 2 can be written as The compound covariance matrix contains the blocks K 11 and K 22 associated with y 1 and y 2 , respectively. The block K 21 contains the covariances between the elements of y 1 and y 2 . Centering the multivariate normal pdf at zero is a reasonable choice for the estimation of model bias. It means that an unbiased model is a priori regarded as the most likely option.
The posterior pdf is related to the joint distribution by The solution for a multivariate normal pdf is another multivariate normal pdf. For equation (3), the result is given by with the posterior mean vector y 0 1 and covariance matrix K' 11 (e.g., [1, A.3]), The important property of these equations is the fact that the posterior moments depend only on the observed vector y 2 . The introduction of new elements into y 1 and associated columns and rows into K 11 in equation (3) has no impact on the already existing values in y 0 1 and K' 11 . In other words, one is not obliged to calculate all posterior expectations and covariances at once. They can be calculated sequentially.
GP regression is a method to learn a functional relationship f ðxÞ based on samples of input-output associations {(x 1 , f 1 ), (x 2 , f 2 ), … }. The vector y 2 introduced above then contains the observed functions values of f ðxÞ, i.e. y 2 ¼ ðf 1 ; f 2 . . .Þ T . Assuming all prior expectations to be zero, the missing information to evaluate equations (6) and (7) are the covariance matrices. Because predictions for f ðxÞ should be computable at all possible locations x and the same applies to observations, covariances between functions values must be available for all possible pairs ðx i ; x j Þ of locations. This requirement can be met by the introduction of a so-called covariance function kðx i ; x j Þ. A popular choice is the squared exponential covariance function The parameter d enables the incorporation of prior knowledge about the range functions values are expected to span. The parameter l regulates the smoothness of the solution. The larger l the slower the covariance function decays for increasing distance between x 1 and x 2 and consequently the more similar function values are at nearby locations.
If the set fx p i g contains the locations of interest and fx o j g the observed locations, the required covariance matrices in equations (6) and (7) are given by and

Sparse Gaussian processes
The main hurdle for applying GP regression using many observed pairs (x i , f i ) is the inversion of the covariance matrix k 22 in equations (6) and (7). The time to invert this N Â N matrix with N being the number of observations increases proportional to N 3 . This limits the application of standard GP regression to several thousand observations on contemporary desktop computers. Parallelization helps to a certain extent to push this limit. Another measure is the approximation of the covariance matrix by a low rank approximation. I adopted the sparse approximation described in [6], which will be briefly outlined here. For a survey of different approaches and their connections consult e.g., [5]. Suppose that we have not measured the values in y 2 associated with the locations fx o j g, but instead a vector y 3 associated with some other locations fx psi k g. We refer to y 3 as vector of pseudo-inputs. Now we use equation (6) to determine the hypothetical posterior expectation of y 2 , The matrices K 23 and K 33 are constructed analogous to equations (10) and (9), respectively, i.e. ðK 23 . M and N being the number of observations and M the number of pseudo-inputs. Noteworthy, y' 2 is a linear function of y 3 . Under the assumption of a deterministic relationship, we can replace the posterior expectation y' 2 by y 2 . Using the sandwich formula and rðy 3 Þ ¼ N ðy 3 j0; K 33 Þ, we get for the covariance matrix of y 2K Given that both K 33 and K 23 have full rank and the number of observations N is bigger than the number of pseudo-inputs M, the rank ofK 22 equals M. The approximation is more rigid than the original covariance matrix due to the lower rank.
In order to restore the flexibility of the original covariance matrix, the diagonal matrix D 2 ¼ diag½K 22 ÀK 22 is added toK 22 . This correction is essential for the determination of the pseudo-input locations via marginal likelihood optimization as explained in [6,Sect. 3]. Furthermore, to make the approximation exhibit all properties of a GP, it is also necessary to add Making the replacements 33 K 31 , and K 11 →K 11 þ D 1 in equations (6) and (7), we obtain Using the Woodbury matrix identity (e.g., [1, A. 3]), these formulas can be rewritten as (e.g., [5, Sect. 6]), Noteworthy, the inversion in these expressions needs only to be performed for an M Â M matrix where M is the number of pseudo-inputs. Typically, M is chosen to be in the order of magnitude of hundred. The computational cost for inverting the diagonal matrix D is negligible.
Using equations (15) and (16), the computational complexity scales linearly with the number of observations N due to the multiplication by K 23 . This feature enables to increase the number of observations by one or two orders of magnitude compared to equations (6) and (7) in typical scenarios.

Marginal likelihood maximization
Covariance functions depend on parameters (called hyperparamters) whose values must be specified. In a full Bayesian treatment, the choice of values should not be informed by the same observations that enter into the GP regression afterwards. In practice, this ideal is often difficult to achieve due to the scarcity of available data and therefore frequently abandoned. A full Bayesian treatment may also become computationally intractable if there are too much data.
Two popular approaches to determine the hyperparameters based on the data are marginal likelihood maximization and cross validation, see e.g., [1,Chap. 5]. A general statement which approach performs better cannot be made. I decided to use marginal likelihood optimization because it can probably be easier interfaced with existing nuclear data evaluation methods.
Given a covariance function depending on some hyperparameters, e.g., k(x i , x j |d, l) with hyperparameters d and l as in equation (8), the idea of marginal likelihood maximization is to select values for the hyperparameters that maximize the probability density for the observation vector r(y 2 ). In the case of the multivariate normal pdf in equation (3), it is given by (e.g., [1, Sect. 5.4.1]) The first term is a constant, the second term is up to a constant the information entropy of the multivariate normal distribution, and the third term is the generalized x 2 -value. The maximization of this expression amounts to balancing two objectives: minimizing the information entropy and maximizing the x 2 -value.
The partial derivative of equation (17) with respect to a hyperparameter is given by (e.g., [1, Sect. 5.4.1]) which enables the usage of gradient-based optimization algorithms. In this paper, I use the L-BFGS-B algorithm [15] because it can deal with a large number of parameters and allows to impose restrictions on their ranges. Due to the appearance of the determinant and the inverse of K 22 , the optimization is limited to several thousand observations on contemporary desktop computers. However, replacing K 22 by the approximatioñ (17) and (18) enables to scale up the number of observations by one or two orders of magnitude. The structure of the approximation is exploited by making use of the matrix determinant lemma, the Woodbury identity, and the trace being invariant under cyclic permutations.
The approximation to K 22 is not only determined by the hyperparameters but also by the location of the pseudoinputs {x psi k }. Hyperparameters and pseudo-input locations can be jointly adjusted by marginal likelihood maximization. The number of pseudo-inputs is usually significantly larger (e.g., hundreds) than the number of hyperparameters (e.g., dozens). In addition, the pseudo-inputs are points in a potentially multi-dimensional space and their specification requires a coordinate value for each axis. For instance, in Section 3 sparse GP regression is performed in a five dimensional space with three hundred pseudo-inputs, which gives 1500 associated parameters. Because equation (18) has to be evaluated for each parameter in each iteration of the optimization algorithm, its efficient computation is important.
The mathematical details are technical and tedious and hence only the key ingredient for efficient computation will be discussed. Let x kl be the lth coordinate of the kth pseudoinput. The crucial observation is that ∂K 33 /∂ x kl yields a matrix in which only the lth and kth column and row contain non-zero elements. A similar statement holds for ∂K 23 /∂ x kl . This feature can be exploited in the multiplications and the trace computation in equation (18) (where K 22 is substituted byK 22 þ D 2 ) to achieve OðNMÞ per coordinate of a pseudo-input with M being the number of pseudo-inputs, and N being the number of observations. This is much more efficient than OðdNM 2 Þ for the partial derivative with respect to a hyperparameter.

Scenario
The model bias was determined for the C++ version of the Liège INCL [7], a Monte Carlo code, coupled to the evaporation code ABLA07 [9] because this model combination performs very well according to an IAEA benchmark of spallation models [16,17] and is used in transport codes such as MCNPX and GEANT4. This suggests that the dissemination of a more quantitative performance assessment of INCL coupled to ABLA potentially helps many people to make better informed decisions. Because some model ingredients in INCL are based on views of classical physics (as opposed to quantum physics), the model is mainly used for high-energy reactions above 100 MeV.
The ability of a model to accurately predict the production of neutrons and their kinematic properties may be regarded as one of the most essential features for nuclear engineering applications. Especially for the development of the innovative research reactor MYRRHA [18] driven by a proton accelerator, these quantities need to be well predicted for incident protons.
For these reasons, I applied the approach to determine the model bias in the prediction of inclusive doubledifferential neutron spectra for incident protons and included almost all nuclei for which I found data in the EXFOR database [10]. Roughly ten thousand data points were taken into account. Table 1 gives an overview of the data.

Design of the covariance function
The covariance function presented in equation (8) is probably too restrictive to be directly used on doubledifferential spectra. It incorporates the assumption that the model bias spans about the same range for low and high emission energies. Because the neutron spectrum quickly Table 1. Summary of the double-differential (p,X)n data above 100 MeV incident energy found in EXFOR [10]. No mass number is appended to the isotope name in the case of natural composition. Columns are: incident proton energy (En), range of emitted neutron energy (E min , E max ), range of emission angles (u min , u max ), and number of data points (NumPts). Energy related columns are in MeV and angles in degree. The total number of data points is 9287. declines by orders of magnitude with increasing emission energy, it is reasonable to use a covariance function with more flexibility to disentangle the systematics of the model bias associated with these two energy domains. Assuming the two covariance functions k 1 (x i , x j ) and k 2 (x i , x j ), a more flexible covariance function can be constructed in the following ways (e.g., [1,Sect. 4 Taking into account that the purpose of a covariance function is to compute elements of a covariance matrix, the construction in equation (19) is analogous to the possible construction of an experimental covariance matrix: An experimental covariance matrix can be assembled by adding a diagonal covariance matrix reflecting statistical uncertainties to another one with systematic error contributions. Please note that sparse GP regression as presented in this paper works only with a diagonal experimental covariance matrix. Equation (20) will be used to achieve a transition between the low and high energy domains.
To state the full covariance function used in this paper, the following covariance function on a one-dimensional input space is introduced, The meaning of the hyperparameter l was explained below equation (8).
The inclusive double-differential neutron spectra for incident protons over the complete nuclide chart can be thought of as a function of spectrum values associated with points in a five dimensional input space. The coordinate axes are incident energy (En), mass number (A), charge number (Z), emission angle (u) and emission energy (E). Using equation (21) we define two covariance functions k L and k H associated with the low and high energy domain of the emitted neutrons. Given two input vectors the form of the covariance function for both k L and k H is assumed to be The hyperparameters l C x are for the coordinate axis indicated by x ∈ {En, A, …} and can take different values for k L and k H (C ∈ {L, H}).
The transition between the two energy domains, which means to switch from k L to k H , is established by the logistic function Noteworthy, all the variables are vectors. The equation kðx À x 0 Þ ¼ 0 defines a (hyper)plane with normal vector k and distance jkx 0 j=jkj to the origin of the coordinate system. The function in equation (25) attains values close to zero for x far away from the plane on one side and close to one if far away on the other side. Within which distance to the plane the transition from zero to one occurs depends on the length of k. The larger |k|, the faster the transition and the narrower the window of transition around the plane.
With the abbreviations the full covariance function k full is given by Finally, the GP regression is not performed on the absolute difference D between a model prediction s mod and experimental data point s exp , but on the transformed quantityD ¼ ðs exp À s mod Þ maxðs mod ; 0:1Þ : ð30Þ In words, relative differences are taken for model predictions larger than 0.1 and absolute differences scaled up by a factor of ten for model predictions below 0.1. Relative differences fluctuate usually wildly for spectrum values close to zero-especially for a Monte Carlo code such as INCL-and the switch to absolute values helps GP regression to find more meaningful solutions with a better ability to extrapolate.
Due to the number of roughly ten thousand data points, the covariance matrices computed with equation (29) were replaced by the sparse approximation outlined in Section 2.2. I introduced three hundred pseudo-inputs and placed them randomly at the locations of the experimental data. Their locations were then jointly optimized with the hyperparameters, which will be discussed in Section 3.3.
The diagonal matrix D 2 occurring in the approximation was changed toD to accommodate statistical uncertainties of the model prediction (due to INCL being a Monte Carlo code) and the experimental data. Both B and P are diagonal matrices. The matrix B contains variances corresponding to 10% statistical uncertainty for all experimental data points. The matrix P contains the estimated variances of the model predictions.
A diagonal matrix for the experimental covariance matrix B can certainly be challenged because the important systematic errors of experiments reflected in off-diagonal elements are neglected. This is at the moment a limitation of the approach.

Marginal likelihood maximization
The hyperparameters appearing in equation (29) and the locations of the three hundred pseudo-inputs {x psi k } determining the approximation in equation (11) were adjusted via marginal likelihood maximization described in Section 2.3. To be explicit, the hyperparameters considered were d L , d H , and also x 0 and k of the logistic function. The vector k was forced to remain parallel to the axes associated with En, A, and Z, i.e. k ¼ ð0; 0; 0; k u ; k E Þ T . Further, polar coordinates k u = k c sing, k E = k c cosg were introduced. The direction of the vector x 0 was taken equal to that of k, which removes ambiguity in the plane specification without shrinking the set of possible solutions. Because of these measures, it sufficed to consider the length x c = |x 0 | as hyperparameter.
Counting both hyperparameters and pseudo-inputs, 1515 parameters were taken into account in the optimization. I employed the L-BFGS-B algorithm [15] as implemented in the optim function of R [19], which makes use of an analytic gradient, can deal with a large number of variables and permits the specification of range restrictions. The optimization was performed on a cluster using 25 cores and was stopped after 3500 iterations, which took about 10 h.
The obtained solution corresponds to x 2 /N = 1.03 and is with a two-sided p-value of 0.04 reasonably consistent in a statistical sense. Restrictions of parameter ranges were established to introduce prior knowledge and to guide the optimization procedure. Noteworthy, lower limits on length-scales, such as l L u and l L E ; were introduced to counteract their dramatic reduction due to inconsistent experimental data in the same energy/angle range. Table 2 summarizes the optimization procedure. The evolution of the pseudo-inputs projected onto the (A,Z)-plane is visualized in Figure 1.
A thorough study of the optimization process exceeds the scope of this paper and hence I content myself with a few remarks. The length scales associated with the emission angle, i.e. l C u , experienced significant changes. Their increase means that the model bias is similar for emission angles far away from each other and the GP process is able to capture this similarity. Concerning the length scales associated with the emission energy, the small value l L E ¼ 4:8 compared to the larger value l H E ¼ 43 indicates that the features of the model bias of the low and high energy domain are indeed different. The most striking feature, however, is that the large length scales l C A and l C Z "survived" the optimization, which means that the model bias behaves similar over large regions of the nuclide charts. Examples of isotope extrapolations will be given in Section 3.4. As a final remark, a more rigorous study of the optimization procedure is certainly necessary and there is room for improvement. This is left as future work.

Results and discussion
The values of the hyperparameters and pseudo-inputs obtained by marginal likelihood maximization were used in the covariance function in equation (29). This covariance function was employed to compute the required covariance matrices in equations (15) and (16) based on the experimental data summarized in Table 1. Because the hyperparameters are determined during hyperparameter optimization before being used in the GP regression, they will be referred to as prior knowledge.
Equations (15) and (16) Figure 2 serve as the basis to discuss general features of GP regression, the underlying assumptions, and its accuracy and validity.
How well we can interpolate between data and how far we can extrapolate beyond the data depends on the suitability of the covariance function for the problem at hand. The building block for the full covariance function in equation (29) is the one-dimensional covariance function in equation (21). Using the latter imposes the assumption that possible solutions have derivatives of any order and hence are very smooth [1,Chap.4]. Interpolations between data points included in the regression are determined by this smoothness property and values of the length scales l C x . The length scales reflect the prior assumption about similarity between spectrum values of points a certain distance away from each other. This prior assumption directly impacts the uncertainty of the predictions. The farther away a prediction is from an observation point, the higher the associated uncertainty. If a prediction is already multiples of any length scale away from all observations, the uncertainty reverts to its prior value given by either d L or d H depending on the energy domain.
In the case of the sparse approximation, the uncertainty is related to the distance to the pseudo-inputs. Because only few pseudo-inputs are located at very high and very low emission energies, the 2s uncertainty bands in Figure 2 in those energy domains are rather large despite the presence of experimental data.
The important finding in this specific application is that the length scales related to emission angle, l C E , mass number, l C A , and charge number, l C E , are very large. GP regression is therefore able to interpolate and extrapolate over large ranges of these axes.
The interpolation of spectrum values between angles and emission energies of an isotope with available data may be considered rather standard. For instance, one can do a x 2 -fit of a low order Legendre polynomial to the angular distributions for emission energies with data. The coefficients of the Legendre polynomial for intermediate emission energies without data can then be obtained by linear interpolation.
The important difference between such a conventional fit and GP regression is the number of basis functions. Whereas their number is fixed and limited in a conventional fit, GP regression amounts to a fit with an infinite number of basis functions [1,Sect. 2.2]. The length scales regulate the number of basis functions that effectively contribute to the solution. Hyperparameter optimization decreases the length scales for unpredictable data which leads to a greater number of contributing basis functions and consequently to greater flexibility and larger uncertainties in the predictions. This feature sets GP regression apart from a standard x 2 -fit. In the latter, the uncertainty of the solution depends to a much lesser extent on the (un)predictability of the data and much more on the number of data points and the fixed number of basis functions.
One truly novel element in the approach is the inclusion of the mass and charge number, which enables predictions for isotopes without data. We can easily imagine that different isotopes differ significantly by their physical properties. From this perspective, the idea to extrapolate the model bias to other isotopes should be met with healthy skepticism.
To get a first grasp on the validity of isotope extrapolations, let us consider again the hyperparameter optimization discussed in Section 3.3. The hyperparameters were adjusted on the basis of the isotopes in Table 1. These data are spread out over the periodic table and cover a range from carbon to thorium. In these data, similar trends of the model bias persist across significant ranges of the mass and charge number, which was the reason that the associated length scales retained high values during optimization. For instance, the experimental data of carbon and indium in Figure 2 show comparable structures of the model bias despite their mass differences.
However, the isotopes considered in the optimization are not very exotic and gaps between them are at times large. Further, these isotopes are certainly not a random sample taken from the periodic table and therefore most theoretical guarantees coming from estimation theory do not hold. So how confident can we be about isotope extrapolation?
To provide a basis for the discussion of this question, Figure 2 also contains predictions for oxygen and cadmium. Importantly, the associated experimental data have not been used for the hyperparameter optimization and in the GP regression. The predictions follow well the trend of the experimental data. The supposedly 2s uncertainty bands, however, include less than 95% of the data points. This observation points out that there are systematic differences between isotopes. Therefore, due to the sample of isotopes not being a random sample, uncertainty bands should be interpreted with caution. One way to alleviate this issue could be to add a covariance function to equation (29) which only correlates spectrum values of the same isotope but not between isotopes. This measure would lead to an additional uncertainty component for each isotope, which only decreases if associated data are included in the GP regression.
A very extreme mismatch up to 500% between the prediction and experimental data occurs for oxygen at 60°a nd emission energies below 1 MeV. The creation of low energy neutrons is governed by de-excitation processes of the nucleus suggesting that the angular distribution of emitted particles is isotropic. This property holds for the model predictions but not for the experimental data. The experimental data exhibits a peak at 60°which is about a factor five higher than at 30°, 120°and 150°. The origin of this peculiarity may deserve investigation but is outside the scope of this work. For the sake of argument I assume that it is indeed a reflection of some property of the nucleus.
Because the data in Table 1 do not include any measurements below 1 MeV emission energy in this mass range, both hyperparameter optimization and GP regression had no chance to be informed about such a feature. In this case, the predictions and uncertainties are determined by nearby values orif there are not anyby the prior expectation.
If we had been aware of such effects, we could have used it as a component in the covariance function to reflect our large uncertainty about the spectrum for low emission energies. Otherwise, the only sensible data-driven way to provide predictions and uncertainties for unobserved domains is to assume that effects are similar to those in some observed domains. Of course, it is our decision which domains are considered similar. The employed covariance function in equation (29) incorporates the assumption that nearby domains in terms of mass charge, angle, etc. are similar. However, we are free to use any other assumption about similarity to construct the covariance function. As a side note, also results of other models relying on different assumptions could serve as a basis for uncertainties in unobserved regions. Such information can be included in GP regression in a principled way.

Summary and outlook
Sparse GP regression, a non-parametric estimation technique of statistics, was employed to estimate the model bias of the C++ version of the Liège INCL coupled to the evaporation code ABLA07. Specifically, the model bias in the prediction of double-differential inclusive neutron spectra over the complete nuclide chart was investigated for incident  (30) in the (p,X)n double differential spectra for 800 MeV incident protons and different isotopes as predicted by GP regression. A missing mass number behind the isotope symbol indicates natural composition. The uncertainty band of the prediction and the error bars of the experimental data denote the 2s confidence interval. Carbon and indium were taken into account in the GP regression but not cadmium and oxygen. The experimental data are colored according to the associated access number (ACCNUM) in the EXFOR database. This shows that all displayed data come from just three experiments [20][21][22].
protons above 100 MeV. Experimental data from the EXFOR database served as the basis of this assessment. Roughly ten thousand data points were taken into account. The hyperparameter optimization was done on a computing cluster whereas the GP regression itself on a desktop computer. The obtained timings indicate that increasing the number of data points by a factor of ten could be feasible.
For this specific application, it was shown that GP regression produces reasonable results for isotopes that have been included in the procedure. It was argued that the validity of predictions and uncertainties for isotopes not used in the procedure depends on the validity of the assumptions made about similarity between isotopes. As a simple benchmark, the isotopes oxygen and cadmium, which have not been taken into account in the procedure, were compared to the respective predictions. The agreement between prediction and experimental data was reasonable but the 95% confidence band sometimes misleading and should therefore be interpreted with caution. Accepting the low energy peak of oxgyen at 60°i n the data as physical reality, the low energy spectrum of oxygen served as an example where the similarity assumption between isotopes did not hold.
As for any other uncertainty quantification method, it is a hard if not impossible task to properly take into account unobserved phenomena without any systematic relationship to observed ones. The existence of such phenomena are unknown unknowns, their observation is tagged shock, surprise or discovery, and they potentially have a huge impact where they appear.
Even though GP regression cannot solve the philosophical problem associated with unknown unknowns, knowledge coming from the observation of new effects can be taken into account by modifying the covariance function. For instance, the peculiar peak in the oxygen spectrum suggests the introduction of a term in the covariance function which increases the uncertainty of the spectrum values in the low emission energy domain. The ability to counter new observations with clearly interpretable modifications of the covariance function represents a principled and transparent way of knowledge acquisition.
The scalability and the possibility to incorporate prior assumptions by modeling the covariance function makes sparse GP regression a promising candidate for global assessments of nuclear models and to quantify their uncertainties. Because the formulas of GP regression are structurally identical to those of many existing evaluation methods, GP regression should be regarded more as a complement than a substitute, which can be interfaced with existing evaluation methods.