Issue
EPJ Nuclear Sci. Technol.
Volume 11, 2025
Special Issue on ‘Overview of recent advances in HPC simulation methods for nuclear applications’, edited by Andrea Zoia, Elie Saikali, Cheikh Diop and Cyrille de Saint Jean
Article Number 53
Number of page(s) 13
DOI https://doi.org/10.1051/epjn/2025050
Published online 22 September 2025

© E. Saikali et al., Published by EDP Sciences, 2025

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

TRio-U Software for Thermal-hydraulics (TRUST) is a computational fluid dynamics (CFD) software distributed under a BSD open-source license [1]. It has been under development since 1993 by the Energy Division (DES) of the French Atomic and Alternative Energy Commission (CEA), and is designed around a parallel, object-oriented architecture using the C++ programming language [2].

The main open-source application derived from the platform, TrioCFD [3], can simulate a broad range of thermohydraulic problems, from turbulent single-phase flows to compressible multiphase regimes [4, 5]. The platform also supports modeling multi-species Low Mach Number flows (similar to combustion models) using its weakly compressible flow approach. While TRUST was originally developed for nuclear applications [6], its use has extended to fields such as hydrogen safety, lithium-ion battery modeling, and proton exchange membrane fuel cells (PEMFC) [7, 8].

TRUST implements various numerical schemes (including finite difference, finite volume, and finite element methods), supports multiple mesh types (see Fig. 1) and was designed from the start to be highly efficient. This flexibility allows the software to run efficiently on a variety of systems, ranging from conventional workstations to high-performance computing (HPC) environments. Reusability and simplicity remain core principles of the platform’s design.

thumbnail Fig. 1.

Series of mesh types supported by the TRUST software.

To support parallel execution, TRUST leverages the METIS library [9], which partitions the computational domain into overlapping subdomains. This ensures balanced workload distribution among MPI processes and improves performance. Typically, these subdomains are evenly allocated to processors, which then communicate mostly with their direct neighbors as needed through MPI, and collectively when required. The platform is also linked to high-performance libraries such as PETSc [10] and MEDCoupling [11], reinforcing its scalability and computational efficiency.

Input and output operations are parallelized using the HDF5 library [12], allowing the use of both shared and distributed file formats. Among these, the CFD General Notation System (CGNS) [13], which is built on HDF5, is supported for post-processing workflows.

TRUST is capable of managing very large simulations by employing 64-bit integer indexing, enabling the handling of domains with more than 250 million cells. This limit depends on the mesh type and is determined by the element-to-node connectivity table. For example, the limit when using a hexahedral mesh is 228 ≈ 268, 435, 456 cells. One of the largest simulations to date used TRUST to simulate 2 billion cells across 50,000 MPI processes [14] (see later Sect. 3).

Historically, as with many legacy codes, TRUST was optimized for CPU-only architectures. However, with future supercomputers expected to derive most of their performance from GPU-accelerated compute nodes – while only a small fraction (around 30 PFlops, or approximately 3%) comes from scalar CPU nodes – adapting to this new paradigm poses a major challenge. This shift demands significant redevelopment efforts, particularly in porting existing algorithms to GPU-friendly forms.

To address this, the development of TRUST has progressively integrated GPU-targeted libraries to accelerate key computational parts of the code. The ultimate objective is to create a hybrid CPU/GPU version of the platform that performs efficiently regardless of the compute node architecture.

In 2020, the AmgX library [15] developed by NVIDIA was tested and successfully integrated into TRUST. Subsequently, in 2022, rocALUTION [16], a sparse linear algebra library for AMD GPUs, was included in the codebase as part of a GENCI support contract during the deployment of the Adastra supercomputer at CINES [17]. While rocALUTION produced encouraging results, it underperformed compared to AmgX in several areas, motivating the exploration of alternative solutions. Current development efforts focus on ensuring compatibility with multiple architectures, including NVIDIA, AMD, and Intel GPUs.

The structure of this paper is as follows: Section 2 introduces the architecture of the TRUST platform. Section 3 presents the largest simulation performed to date using the platform. Sections 4 and 5 describe the strategy adopted for GPU integration (both on the linear solver side, and on the compute kernel side). Section 6 presents the main results obtained so far, and finally Section 7 concludes the study and draws the prospects.

2. Code architecture

TRUST is built on a robust and efficient software architecture, leveraging the strengths of the C++ programming language and object-oriented design principles. The choice of C++ is historically rooted and remains justified for several reasons:

  • it supports core object-oriented features such as inheritance, polymorphism, and encapsulation;

  • its strong type system contributes to software reliability;

  • as a compiled language, it delivers the high performance required for intensive computations;

  • being a widely adopted industry standard, C++ benefits from a vast ecosystem of development tools and libraries, thereby extending the software’s longevity and maintainability.

Modern developments within TRUST have adopted features from recent C++ standards, including templates, smart pointers, SFINAE design pattern (Substitution Failure Is Not An Error [18]), and the Standard Template Library (STL), all of which contribute to improved code readability and maintainability. The design prioritizes clarity and simplicity, with coarse-grained inheritance structures and straightforward loop constructs.

The platform is supported by tools such as Doxygen for code documentation and Jupyter notebooks for validation, offering a modern and accessible environment for development, computational experimentation, validation/verification, and physical analysis.

Figure 2 outlines the core modules of TRUST–kernel, spatial discretizations, and physics – as well as its dependencies and toolchain. The modular architecture facilitates the integration of new features, classes, and physical models. Developers can easily extend the codebase by deriving new classes from existing ones, either generically or in a way tailored to specific spatial discretizations (see Fig. 1 for mesh compatibility).

thumbnail Fig. 2.

Main TRUST modules and tools.

At the lowest level, parallelism in TRUST is implemented using the MPI (Message Passing Interface) library, following the Single Program Multiple Data (SPMD) paradigm. This approach is deeply embedded in the platform’s core architecture, particularly in the design of data structures such as distributed arrays. These arrays are natively constructed to handle domain splitting and inter-process communication, enabling efficient large-scale parallel computations. By adopting MPI at the foundation level, TRUST ensures scalability across high-performance computing systems, making it suitable for both fine-grained and large-domain simulations.

To support the loading and initialization of large simulation domains, TRUST utilizes 64-bit integers for indexing mesh entities and associated data structures prior to domain partitioning. This design choice enables the platform to handle meshes containing billions of elements, surpassing the limitations imposed by 32-bit indexing. The use of 64-bit precision is limited to the initial loading phase (before the domain is split across MPI processes) where global indices must accommodate the full size of the unpartitioned domain. A specific C++ type trustIdType was introduced for indices of geometric entities that may overflow 32 bits. By contrast, once the domain is split, each processor only deals with 32-bit integers, as the sub-domain on each processor is always small enough. Consequently all physical models, post-processings, and equation solving remain coded with standard integers and did not need any porting.

Checkpoint-restart capabilities in TRUST are managed through the Portable Data Interface library (PDI) [19], which provides an abstract and flexible framework for data I/O. PDI integrates seamlessly with TRUST via a plugin-based architecture, allowing the use of high-performance backends such as HDF5. This modularity ensures adaptability to various storage systems and output formats. Importantly, PDI supports scalable data management by enabling the generation of one checkpoint file per computation node, rather than per MPI rank–significantly reducing the number of output files and improving I/O performance at scale (typically, cluster filesystems do not support well the generation of several thousands inodes).

While TRUST can function as a standalone general-purpose simulation platform, it is also designed for extensibility. Specialized applications can be built by reusing or overriding specific modules and classes. One such application, internally referred to as BALTIK (Build an Application Linked to TRUST Internal Kernel), adapts TRUST for various modeling, including fine-scale multiphase turbulence (e.g., TrioCFD [3]), component-level simulations (e.g., the 3D module of CATHARE [20]), and even non-nuclear systems such as battery and PEM fuel cell simulations–among other CEA internal projects.

3. HPC capabilities: the largest 2-billion-cell performed simulation revisited

This section reviews the largest Direct Numerical Simulation (DNS) performed to date using the TRUST platform. For full details, the reader is referred to the original study in [14].

The work focused on safety assessments of hydrogen energy systems by modeling hydrogen leaks in confined, ventilated spaces. Hydrogen stored at high pressure (up to 700 bar) can form flammable mixtures with air upon accidental leakage, posing serious hazards when the hydrogen volume fraction lies between 4% and 75% [21]. Passive ventilation systems are commonly implemented to mitigate hydrogen accumulation [22], and accurately predicting hydrogen-air mixture dynamics is critical for safety evaluations [23].

The DNS approach was employed to study buoyant jet flows inside a ventilated experimental cavity of 1 m3, replicating a typical two-vent fuel cell configuration with passive ventilation [24]. This represented the first DNS at such a large scale and Richardson number ≈101, producing comprehensive 3D instantaneous and statistical data intended as a benchmark reference.

The governing equations included conservation of mass, momentum, and species, under a low Mach number approximation suitable for iso-thermal and iso-baric conditions with high density ratios [25, 26]. The Finite Difference Volume (FDV) method was employed on a staggered MAC grid, using second-order central differencing combined with a third-order QUICK scheme for species convection to ensure monotonicity. Time integration relied on a semi-implicit approach, with diffusion terms solved via a conjugate gradient method using the PETSc library [10], and pressure-velocity coupling handled by an incremental projection method with an SSOR solver.

Two meshes were used based on an estimated turbulent Kolmogorov length scale of η = 8 × 10−4 m, referred to as Mesh 1 and Mesh 2. In both cases, non-uniform unstructured hexahedral meshes were generated using the open-source SALOME platform, leveraging the Hexahedron (i, j, k) and body-fitted algorithms [11].

Mesh 1 contained 250 million cells distributed across 5376 MPI processes, with non-uniform grid spacing ranging from δ = 1 mm near vents to 4 cm in the far field. Mesh 2 was constructed by refining each cell of Mesh 1 into 8 sub-cells, yielding 2 billion cells distributed over 50047 MPI processes. This finer resolution, ranging from δ = 0.5 mm to 2 cm, enabled full resolution of turbulent structures below the Kolmogorov scale (see Fig. 3).

thumbnail Fig. 3.

Iso-contours of the ratio δ/η in the mid-vertical xz-plane.

To accelerate the transient phase of the DNS, stationary field variables computed with Mesh 1 were interpolated onto Mesh 2. Since the simulation involved 50047 MPI processes, the high-performance MEDCoupling library was used in conjunction with its parallel interpolator InterpKernelDEC to achieve efficient and accurate parallel interpolation [11]. Figure 4 shows the vorticity magnitude of the interpolated velocity field in the mid-vertical xz-plane (y = 0), few seconds after the initial condition for the refined simulation.

thumbnail Fig. 4.

Instantaneous vorticity magnitude iso-contours in the mid-vertical xz-plane.

Results indicated that Mesh 2 better resolved small-scale turbulent structures near the injection zone, particularly the entrainment of ambient air by the rising hydrogen jet (see Fig. 5). This improvement is consistent with the finer spatial resolution below the Kolmogorov scale. In contrast, Mesh 1 adequately captured the larger scales in the upper plume region, where turbulence is less intense. The jet-to-plume transition zone, located near the injection point, exhibited the smallest turbulent scales driven by Rayleigh–Taylor instabilities, further justifying the need for Mesh 2 to accurately represent the flow physics.

thumbnail Fig. 5.

Iso-contours of velocity magnitude near the release zone, showing enhanced resolution of turbulent structures with Mesh 2 (right).

Quantitatively, the DNS results were also compared to experimental measurements using a vertical time-averaged hydrogen volume fraction profile ⟨X1t, considered along the same locations as the 15 minicatharometers used in the GAMELAN experiment (see Fig. 6). The solid black line represents the DNS, while symbols denote experimental data. The agreement is excellent, with both DNS and experiment capturing a bi-layer distribution, the interface height between layers, and a maximum hydrogen concentration of approximately 1.6% in the upper homogeneous layer. The absolute discrepancy is within 0.1%, attributable to experimental measurement uncertainties or numerical discretization errors. These results mark a significant improvement over previous RANS and LES models, which failed to reproduce such a stratified configuration under similar release conditions [27, 28].

thumbnail Fig. 6.

Vertical profile of time-averaged hydrogen volume fraction near the wall. DNS: solid line; experiment: symbols.

It is worth emphasizing that this large-scale reference DNS was made possible by the high-performance computing (HPC) capabilities of TRUST, including efficient domain splitting with METIS, optimized solvers with PETSc, and parallel I/O using HDF5 and CGNS. The simulation exhibited excellent scalability on tens of thousands of MPI processes, establishing TRUST as a state-of-the-art platform for massive CFD computations. The work was supported by HPC resources at TGCC IRENE-ROME under GENCI allocation A0092A12033.

4. Porting linear solvers to GPU

The total computational cost of a TRUST simulation can be broadly divided into two main components. The first – and most computationally intensive – involves solving one or more sparse linear systems of the form Ax = b, which typically arises when computing the pressure field in incompressible Navier–Stokes simulations or during implicit treatments of certain terms, such as diffusion. The second component consists of evaluating various physical quantities, including fluxes across element faces or coefficients used in implicit matrix assembly.

Among these, the resolution of sparse linear systems dominates the overall cost, often accounting for up to 80% of total simulation time [14]. As a result, this step is a primary target for performance optimization efforts within the platform.

When running on CPU only, TRUST traditionally uses the PETSc sparse linear algebra library [10]. Although the PETSc project is making an increasing effort to port their library to GPU, at the time of writing this was not mature enough, and we did not investigate further this possibility. In recent years several sparse linear solvers adapted to GPUs have been proposed. Among these we tested in this paper: AmgX [15], rocALUTION [16] and AMGCL [29].

For the second part, OpenMP [30] was formerly used in the TRUST code to port to GPU some of the most intensive compute kernels, but an ongoing effort is done to move to the Kokkos library [31] which allows to write performance-portable code for various types of architectures.

4.1. AmgX

AmgX is a high-performance, open-source library developed by NVIDIA to accelerate the solution of large sparse linear systems on their GPU architectures [15]. It implements a suite of iterative solvers and Algebraic Multigrid (AMG) preconditioners that are optimized for execution on NVIDIA GPUs. Within TRUST, the most commonly used iterative solvers provided by AmgX include Conjugate Gradient (CG), Generalized Minimal Residual (GMRES), and BiConjugate Gradient STABilized (BiCGSTAB), which cover the resolution of a wide range of symmetric and non-symmetric linear systems encountered in fluid dynamics applications.

To enhance convergence efficiency and parallel scalability, AmgX also offers a variety of preconditioners–critical components in sparse linear solver performance. These include the Classical AMG (C-AMG) based on the Ruge–Stüben algorithm and the Unsmoothed Aggregation AMG, both of which are suitable for different problem structures and hardware constraints. Additionally, AmgX supports multiple relaxation (smoother) techniques such as Jacobi, Gauss–Seidel, Successive Over-Relaxation (SOR), Incomplete LU (ILU), and Chebyshev polynomial smoothers.

Integration of AmgX into TRUST is managed via the Solv_AMGX class, which is part of the platform’s solver class hierarchy. As illustrated in Figure 7, this class currently inherits from Solv_Petsc, which encapsulates interactions with the PETSc library. This architectural choice allows TRUST to maintain its native matrix storage in PETSc’s mataij format, while leveraging AmgX solvers through a data conversion layer. The conversion is handled by the AmgXWrapper project [32], which facilitates translation of PETSc data structures into the format required by AmgX.

thumbnail Fig. 7.

Implementation of the AmgX solver as a TRUST solver.

Below is the typical implementation of a call to the AmgX solver in TRUST:


1 int Solv_AMGX :: solve (ArrOfDouble & residue) {
2  // Transfer input data to device if necessary
3  mapToDevice (rhs_);
4  // Mark output data as being computed on the device
5  computeOnTheDevice (lhs_);
6  statistics (). begin_count (gpu_library_counter_);
7  // Invoke \texttt {AmgX} library - solving Ax=B:
8  SolveurAmgX_. solve (addrOnDevice (lhs_), addrOnDevice (rhs_), nRowsLocal, threshold_);
9  statistics (). end_count (gpu_library_counter_);
10  Cout << "[ AmgX ] Time to solve system on GPU : " << statistics (). last_time (gpu_library_counter_) << finl;
11  // Compute and return the number of iterations :
12  int nbiter = -1;
13  SolveurAmgX_. getIters (nbiter);
14  SolveurAmgX_. getResidual (0, residu (0));
15  if (nbiter >0)
16   SolveurAmgX_. getResidual (nbiter - 1, residu (nbiter));
17  return nbiter;
18 }

The model in TRUST refers to a parallel programming strategy commonly used: each MPI rank (process) is assigned to control exactly one GPU. Following this strategy, the local matrix on each MPI process is transferred to its corresponding GPU.

4.2. rocALUTION

rocALUTION [16] is an open-source linear algebra library developed as part of AMD’s ROCm ecosystem, specifically optimized for AMD GPU architectures. The library provides a wide range of numerical functionalities, including sparse matrix manipulations, iterative and direct solvers, and various preconditioning techniques. Designed to operate efficiently on heterogeneous systems, rocALUTION supports both host (CPU) and device (GPU) execution backends, and can dynamically shift computations between them, making it suitable for hybrid CPU–GPU workflows. Its clean, modular design and comprehensive documentation facilitate integration into large-scale simulation frameworks.

Within TRUST, the integration of rocALUTION is encapsulated in the Solv_rocALUTION class, which inherits from the generic Solv_Externe base class (see again Fig. 7). This base class provides a shared infrastructure for interfacing with external solver libraries–namely PETSc, AmgX, and rocALUTION–and abstracts common operations required for external coupling. These include the consistent management of matrix and vector indexing schemes (supporting both local and global numbering), the identification and treatment of shared entities across MPI ranks (such as joint vertices and faces), and the conversion of internal data structures into formats suitable for external libraries.

In the case of rocALUTION, this conversion process specifically targets the Compressed Sparse Row (CSR) format, referred to internally as Morse format in TRUST. The interface ensures that distributed matrices and vectors in TRUST are properly mapped to rocALUTION’s expected data layout, while maintaining synchronization across subdomains. Although the GPU support in rocALUTION is still evolving compared to more mature libraries like AmgX, its compatibility with AMD hardware makes it a valuable option for users targeting HIP-compatible platforms or porting simulations to non-NVIDIA HPC environments.

The inclusion of rocALUTION in the TRUST solver ecosystem reinforces the platform’s architecture-neutral design philosophy, enabling users to take advantage of diverse hardware backends without significant changes to the simulation code or data handling logic.

4.3. AMGCL

AMGCL (Algebraic MultiGrid Computation Library) [29] is a lightweight, header-only C++ library designed to provide scalable and efficient algebraic multigrid (AMG) solvers. It supports both CPU and GPU backends and is intended for use in HPC applications. AMGCL is highly modular and supports various linear algebra backends, including CPU-based libraries such as Intel MKL, Eigen, and Boost, as well as GPU-accelerated options through CUDA or OpenCL. This versatility makes AMGCL a portable and architecture-agnostic solution, capable of running efficiently on both NVIDIA and AMD hardware.

One of AMGCL’s key strengths lies in its portability. Unlike many GPU solvers that rely solely on vendor-specific technologies, AMGCL can be built with the OpenCL standard, enabling cross-platform compatibility. This allows users to target a wide range of hardware configurations, including NVIDIA GPUs (via CUDA or OpenCL), AMD GPUs (via OpenCL), and traditional CPU architectures – making it particularly appealing for environments that require flexibility or in transition between architectures.

In the context of this work, AMGCL was evaluated primarily from the standpoint of raw solving performance. Unlike PETSc, AmgX, or rocALUTION, no dedicated interface class was implemented within the TRUST codebase for AMGCL. Instead, the solver was benchmarked externally using a shared matrix format. Specifically, matrices were exported from TRUST in a standard binary format compatible with the Matrix Market specification [33], which is also used in the AMGCL project’s examples. This approach allowed for a quick and fair comparison of solver performance without the overhead of integration.

While AMGCL is not yet fully integrated into the TRUST solver hierarchy, the preliminary results suggest promising performance and portability characteristics. As such, it represents a viable candidate for future inclusion, particularly in scenarios requiring cross-platform GPU support or rapid prototyping of multigrid-based preconditioners.

5. Porting computation kernels to GPU

Beyond the resolution of sparse linear systems, a significant portion of the computational workload in TRUST arises from the evaluation of physical models and numerical operators. These tasks include, but are not limited to, the computation of fluxes across element faces, the assembly of residuals and source terms, interpolation operations, and the evaluation of correlations. While individually less expensive than linear solvers, these compute kernels are invoked repeatedly across all mesh elements and time steps, and therefore contribute substantially to the overall runtime. Optimizing these routines is thus essential to achieve end-to-end acceleration of large-scale simulations. The following section outlines the current strategy adopted in TRUST to port these kernels to heterogeneous computing platforms.

5.1. Historical approach with OpenMP

As mentioned previously, the computation kernels responsible for evaluating various physical quantities or coefficients represent the second most computationally intensive component in a typical TRUST simulation, after the solution of sparse linear systems.

Unlike the solvers, which are invoked at well-defined locations in the codebase, these kernels are across numerous parts of the code and embedded within loops over geometrical entities (e.g., faces, elements and space dimension) to compute specific quantities such as diffusion fluxes. Their widespread distribution makes the migration of these routines to GPU architectures significantly more complex and time-consuming.

Historically, the initial strategy to accelerate these kernels relied on the use of OpenMP [30], particularly through the #pragma omp target directive to offload computations to GPUs. This approach was actively pursued for several years. However, it eventually proved to be limited in terms of maintainability, flexibility, and portability.

Below we present a concrete example of a compute kernel from the divergence operator applied to the velocity field on an unstructured mesh using the VEF (Finite Element Volume) spatial discretization. The two code listings illustrate the progression from the original CPU implementation, to its OpenMP-parallelized variant. This will be compared later on to the current Kokkos-based version.


1 for (int elem = 0; elem < nb_elem ; elem ++) {
2     double pscf = 0;
3     for (int indice = 0; indice < nfe ; indice ++) {
4         const int face = elem_faces (elem, indice);
5         const int signe = elem == face_voisins (face, 0) ? 1 : -1;
6         for (int comp = 0; comp < dim ; comp ++)
7             pscf += signe * vit (face, comp) * face_norm (face, comp);
8     }
9     div (elem, 0) += pscf;
10 }

The OpenMP kernel presented below forces us to explicitely compute the indices expansion when accessing the multi-column arrays, which has proven to be very error-prone, and hard to debug as no bound check is performed:


1 #pragma omp target teams distribute parallel for if
2 (computeOnDevice) for (int elem = 0; elem < nb_elem; elem ++) {
3     double pscf = 0;
4     for (int indice = 0; indice < nfe; indice ++) {
5         const int face = elem_faces_addr [elem * nfe + indice];
6         const int signe = (elem == face_voisins_addr [face * 2]) ? 1 : -1;
7         for (int comp = 0; comp < dim ; comp ++)
8             pscf += signe*vit_a [face * dim + comp]* face_norm_a[face * dim + comp];
9     }
10    div_addr [elem] += pscf ;
11 }

5.2. Kokkos usage

Given the OpenMP limitations mentionned previously, a strategic shift was made towards adopting the Kokkos library [31], developed within the U.S. Exascale Computing Project (ECP). Kokkos enables the development of performance-portable C++ kernels that can execute on both CPUs and GPUs without requiring architecture-specific code rewriting. It provides an abstraction layer over heterogeneous architectures–including x86, ARM, and accelerators from NVIDIA, AMD, and Intel–while optimizing memory access patterns and data layouts.

Moreover, Kokkos supports execution on machines equipped with multiple types of accelerators and facilitates the management of data across different memory hierarchies (e.g., HBM, DRAM, NVRAM). These features make Kokkos a sustainable and scalable choice for porting TRUST’s compute kernels.

The implementation of compute kernels using Kokkos is greatly facilitated by its support for modern C++ constructs, particularly lambda functions. This enables a kernel syntax that closely resembles the original CPU version, thereby easing the transition to GPU architectures and improving code readability and maintainability. An additional advantage of Kokkos is its built-in capability to detect out-of-bounds memory access, which contributes to safer and more robust parallel programming practices. A representative example is shown below, where the same divergence operator is applied over mesh elements:


1 auto kern_ajouter = KOKKOS_LAMBDA (int elem) {
2     double pscf = 0;
3     for (int indice = 0; indice < nfe; indice ++) {
4         const int face = elem_faces_v (elem, indice);
5         const int signe = elem == face_voisins_v (face, 0) ? 1 : -1;
6         for (int comp = 0; comp < dim; comp ++)
7             pscf += signe * vit_v (face, comp) * face_norm_v (face, comp);
8     }
9     div_v (elem, 0) += pscf;
10 }; Kokkos::parallel_for ("[KOKKOS] Op_Div ", nb_elem, kern_ajouter);

Memory allocation for the data structures accessed in such kernels rely solely on Kokkos mechanisms: the memory is allocated via the kokkos_malloc method, and the pointer thus obtained is passed to a unmanaged Kokkos view. This comes from an historical background where the memory allocation was performed by OpenMP, but a work in progress aims at directly using the native allocation mechanism of standard Kokkos views.

Another critical consideration concerns the internal memory layout of multi-dimensional arrays, which can significantly affect performance, particularly on GPUs. Kokkos provides two primary layout options for its View objects: LayoutLeft (where the left-most index varies fastest, i.e., column-major order) and LayoutRight (right-most index varies fastest, i.e., row-major order). It is well documented that LayoutLeft typically results in better memory coalescing and data locality on GPUs, which can translate into substantial performance gains [31].

However, TRUST, like many legacy HPC codes developed for CPU execution, historically adopted LayoutRight in accordance with cache optimization strategies typical of those architectures. Migrating to LayoutLeft requires restructuring of the codebase, including explicit data transpositions or reallocation of arrays. The work around this possibility is still in progress and this topic needs to be adressed on a case by case approach. Indeed, in many cases within TRUST, the additional array dimensions (e.g., the number of components per variable) are small–often fewer than ten. This raises the question of whether the performance penalty from using a suboptimal layout is significant enough to warrant the added complexity of data restructuring. Ongoing profiling and benchmarking efforts are being conducted to evaluate this trade-off more precisely and to guide future layout decisions.

6. Results and discussion

6.1. Solver results

This section presents and analyzes the performance of various GPU-accelerated linear solvers integrated within the TRUST platform. These solvers are benchmarked against the widely-used PETSc CPU-based implementation, using a set of test cases with increasing problem sizes in terms of the number of pressure unknowns (from 2.6 million to 166 million). All tests were performed on representative high-performance computing platforms with distinct architectures: the Nvidia-based Topaze supercomputer (CCRT [34]) for AmgX and AMGCL, and the AMD-based Adastra cluster for rocALUTION ([17]).

The aim is to assess both raw performance (time per solve and per iteration) and scalability (evolution of performance with increasing problem size and node count), as well as the robustness of convergence across configurations. This comparison sheds light on the current maturity of each library and highlights challenges still to overcome.

6.1.1. AmgX

To evaluate the integration and performance of AmgX in TRUST, we ran benchmark tests on the CCRT Topaze supercomputer [34], comparing the solver’s GPU performance against PETSc on CPU for identical problem sizes and solver configurations. Each test solves a pressure matrix extracted from a VEF discretization, with system sizes of 2.59, 20.7, and 165.9 million unknowns, using an equivalent number of compute nodes on the scalar (CPU) and accelerated (GPU) partitions. Results are presented in Table 1.

Table 1.

AmgX performance on NVidia GPUs. Tps refers to the time taken to solve a linear system once the setup is done, Its refers to iterations to solve the linear system.

At small scale (2.6M unknowns), AmgX demonstrates a dramatic improvement, reaching solution times more than 13× faster than PETSc (20ms vs. 270ms). As the problem scales, performance remains significant, but the relative speedup diminishes. On the largest case, AmgX is approximately 3× faster (112ms vs. 362ms), primarily due to a degradation in convergence rate: the number of solver iterations increases significantly (from 5 to 18), while PETSc maintains a stable iteration count (5–6), indicating better scalability of its multigrid setup.

The solver configurations used were intentionally identical (CG solver with classical AMG preconditioning and relative convergence tolerance of 5e-4) to isolate the impact of hardware and implementation differences. However, the smoother differs: Jacobi in AmgX vs. Chebyshev in PETSc. This difference may contribute to AmgX’s increasing iteration count at large scale, which reflects a key challenge for GPU solvers: sustaining scalable convergence through parallelizable, efficient preconditioners.

Despite this, AmgX currently stands as the most performant GPU solver integrated in TRUST. It benefits from a mature, actively maintained codebase, and its performance profile suggests that future enhancements, especially in preconditioner scalability, could further solidify its position.

6.1.2. rocALUTION

Initial evaluations of the rocALUTION solver were conducted on the AMD-powered Adastra supercomputer [17], the only accessible platform supporting ROCm (AMD’s equivalent to CUDA). The same sequence of test cases was used, enabling direct comparison to both PETSc and AmgX. The solver was configured with standard AMG preconditioners available in rocALUTION, although tuning was required and convergence issues were observed with several options. Results are presented in Table 2.

Table 2.

rocALUTION performance on AMD GPU.

While rocALUTION achieves very low iteration times (2.4–2.7 ms), the number of iterations required to converge increases dramatically with problem size–up to 249 for the largest case–resulting in overall solve times significantly longer than PETSc or AmgX. Notably, this iteration count is already high at small scale (49 vs. 5 for PETSc), which suggests difficulties in finding suitable scalable preconditioners.

According to the project’s documentation and further discussions with support, most rocALUTION AMG implementations (C-AMG, SA-AMG, UA-AMG) are inherently local and sequential–constructed independently on each MPI subdomain. Only PW-AMG (Pairwise AMG) is parallel, but it remains unstable (recent bug reports on MPI crashes) and too costly in practice to be competitive.

As such, despite promising iteration cost and a successful integration into TRUST, rocALUTION’s convergence behavior on large-scale problems remains a major bottleneck. Significant improvements in preconditioner parallelism would be required for the library to match or exceed its Nvidia counterpart. This is still being worked upon, but this solver might be dropped from TRUST in future releases to be replaced for example by the GPU version of the PETSc/HYPRE preconditionner.

6.1.3. AMGCL

Unlike the other solvers, AMGCL has not yet been integrated into TRUST directly. Instead, we evaluated its performance using external matrix files in Matrix Market format [33], exported from the TRUST platform. The test matrix corresponds to the pressure system on a 2.6 million unknowns tetrahedral mesh, discretized using the VEF scheme.

The solver used a CG method with either C-AMG or SA-AMG preconditioning. The configuration also enabled matrix renumbering to reduce bandwidth, improving memory locality–critical for memory-bound methods on GPU architectures. Smoothers included Jacobi (AmgX) and SPAI (SParse Approximate Inverse, for AMGCL). The coarsest grid solver was set to a direct LU (see Tab. 3).

Table 3.

AMGCL performance on a TRUST matrix – NVidia GPU.

The results are encouraging: AMGCL achieves competitive or even superior solve times compared to AmgX (down to 45ms vs. 60ms), despite its setup time being much higher due to CPU-based preconditioner construction. For simulations the preconditioner is built once and reused over many time steps, this overhead is amortized and becomes negligible. However, scenarios involving Jacobian updates at each time step (e.g. Low Mach Number & Multi-phase models) would suffer significantly from this cost, and alternative strategies would be required.

AMGCL demonstrates excellent potential for TRUST-like matrices, and further tests with multi-node GPU configurations are needed to assess scalability. The library’s confirmed compatibility with AMD GPUs (pending further validation) and the strong documentation and community support make it a promising candidate for future integration.

6.2. OpenMP – Kokkos: computation kernel comparison

In addition to evaluate the solver performance, we also benchmarked the execution of key compute kernels involved in the resolution of the Navier–Stokes equations within the TRUST platform. Specifically, we compared the performance of kernels originally written using the OpenMP programming model (historically used for GPU execution) with their newer counterparts implemented in Kokkos, considering both the initial implementation (from last year) and the most recent version (this year).

This comparison was done on a representative test case, focusing on the time required to compute fluxes associated with the four key differential operators: divergence, gradient, diffusion, and convection. These operators are representative of the broader stencil and compute patterns in TRUST presented in Table 4.

Table 4.

Performance comparison between OpenMP and Kokkos implementations.

The results reveal a non-uniform performance impact of the Kokkos migration, which can be attributed to differences in the computational characteristics of each operator as well as the maturity of the Kokkos implementation in the code.

  • Divergence and Gradient kernels exhibit near parity or modest improvements with Kokkos. The marginal performance gain of 5% for the gradient operator indicates that Kokkos can achieve similar efficiency to OpenMP for memory-bound, low arithmetic-intensity kernels, with no loss in performance portability.

  • Diffusion exhibits with Kokkos a significant performance gain (+28% in 2024, +50% in 2025). This improvement likely stems from more efficient memory access patterns and GPU-tailored loop structures introduced during the port, which Kokkos is able to exploit better than OpenMP on traditional CPU architectures. These benefits appear to be amplified in the 2025 version due to additional architectural improvements and code optimization.

  • Convection, in contrast, initially performed worse under Kokkos in 2024, with a 33% increase in kernel execution time. However, this kernel underwent a substantial rewrite in 2025, with enhancements such as improved data locality and the use of faster storage (e.g., registers rather than global memory for static arrays). These changes led to a 65% performance gain relative to the original OpenMP version. It is likely that similar optimizations would have also benefited the OpenMP implementation.

We emphasize that the current strategy in TRUST relies exclusively on Kokkos, with OpenMP pragmas no longer in use. All Kokkos kernels now achieve speedups ranging from 30× to 100× compared to single-core CPU performance, depending on the GPU used. These gains are strongly correlated with the available memory bandwidth–the higher the bandwidth, the greater the observed acceleration. Overall, the transition to Kokkos and GPU execution delivers a substantial and consistently positive impact at the application level.

6.3. Global performance and scaling results

6.3.1. Chosen application

To assess the combined benefits of GPU-based solvers and Kokkos-accelerated compute kernels in TRUST, a representative physical test case was selected. The scenario involves the injection of a turbulent jet of hot water into a 3D cavity of volume 1 m3, initially filled with a colder ambient fluid.

The flow is characterized by a Reynolds number of approximately 4000 and a temperature difference ranging from 20°C to 60°C. Given the small variation in fluid density, the Boussinesq approximation is applied.

The conservation equations for momentum and energy are solved on a tetrahedral mesh of about 2.5 million cells, generated using the open-source SALOME platform. The simulation employs the Finite-Element Volume (FEV) method with an upwind discretization for convective terms, and a semi-implicit time integration where diffusion terms are treated implicitly. Turbulence is modeled by a Large Eddy Simulation (LES) approach, leveraging the turbulence models provided in TrioCFD, the main open-source application within the TRUST platform. Figure 8 depicts an instantaneous distribution of the temperature field in the vicinity of the injection at a 2D plane section located in the middle of the cavity.

thumbnail Fig. 8.

Instantaneous temperature iso-contours in the injection vicinity depicting the turbulent jet spreading in a mid vertical 2D section.

The GPU-accelerated components of the code include the pressure solver (a conjugate gradient algorithm with an algebraic multigrid preconditioner), the physical models (Boussinesq model, turbulence), the semi-implicit time scheme, and operators written using the Kokkos lambdas.

The comparison highlights several trends (see Fig. 9):

  • the migration from a Rome to a Milan CPU node provides a speedup of approximately 1.8× due to architectural improvements and core count increase.

  • Moving to GPU nodes, even the older V100 GPU offers a 2.5× speedup over the Milan node and a 4.4× gain over Rome.

  • More recent GPU models such as A100 and H100 improve this performance even further, achieving speedups of 4.3× and 6.7× respectively over the Milan CPU node.

thumbnail Fig. 9.

Overall performance comparison between a CPU compute node (AMD Milan) and various NVidia GPU node (V100, A100, H100). Ordinates: arbitrary time units (time per resolution timestep).

Moreover, profiling of the GPU execution (see Fig. 10) shows that more than 96% of the runtime is effectively spent inside GPU kernels, while memory transfers between device and host (H2D, D2H) and remaining CPU operations account for less than 4%. This confirms that the current implementation effectively exploits the GPU architecture, minimizing data movement overheads.

thumbnail Fig. 10.

TRUST speedup on GPU nodes vs CPU nodes (left) and execution time distribution on GPU (bottom right).

6.3.2. Strong scaling

Multi-GPU simulations in TRUST have been tested on several large-scale GPU architectures, including NVIDIA A100 (Topaze cluster [34]), AMD MI250X and AMD MI300A (Adastra cluster [17]), with node counts ranging from 32 to 128. These configurations were run using Kokkos backends for CUDA (NVIDIA) and HIP (AMD), along with GPU-Aware MPI support and algebraic multigrid (AMG) preconditioners–AmgX for NVIDIA GPUs and Hypre for AMD [35].

Strong scaling results performed on A100 platform indicate good scalability for compute kernels written with Kokkos. The physical setup is a LES VEF TrioCFD simulation (destabilization of a turbulent flow by exponential heating) in a 3D channel (355 millions tetrahedrons). The linear solver component remains a clear bottleneck as the number of GPUs increases. This is illustrated in Figure 11, where time breakdowns show that solver time dominates the execution time beyond 64 A100 GPUs. The limited performance of the solvers can be attributed mainly to the negligible benefit (or even performance degradation) when MPI GPU-Aware feature is enabled.

thumbnail Fig. 11.

Multi GPU strong scaling.

Additional tuning efforts, particularly on the Hypre library, have shown that memory usage and solve time can be significantly improved by adjusting a key parameter of the Multi-Grid preconditionner: the strength threshold (see Fig. 12). The performance of AMD MI300A nodes appears competitive with that of the A100 GPUs, highlighting their promise for future deployments. Finally, a roofline analysis conducted with Nsight Compute [36] reveals that the Kokkos kernels currently reach approximately 30% of the FP64 peak performance, suggesting that there is still headroom for further optimization.

thumbnail Fig. 12.

Tuning of Hypre Algebraic Multi-Grid preconditioner (BoomerAMG).

6.3.3. Weak scaling

A weak scaling study was conducted, on the same physical problem presented in Section 6.3.2, to evaluate the parallel performance of TRUST on multi GPU nodes. The test case involved a fluid dynamics simulation whose complexity scaled proportionally with the number of computing nodes, ensuring a constant computational load per node (5 million tetrahedrons per CPU/GPU node). The simulations were executed on two distinct architectures: the Adastra GPU partition (each node equipped with 8 AMD MI250X GPUs), and the Joliot Curie CPU partition (with 128 AMD Rome cores per node).

As shown in Figure 13, GPU-based execution on Adastra exhibits significantly higher throughput (measured in million degrees of freedom per second (MDOF/s)) than CPU-based runs, with an observed speedup factor reaching 3.6× on 4 GPUs (half a node) counts and stabilizing around 2× at scale; 2048 GPUs (256 nodes). While ideal scaling (dashed lines) would maintain linear growth in performance, the actual GPU curve deviates increasingly at large node counts. This reduction in efficiency is attributed to two main factors: (1) the lower convergence rate of algebraic multigrid preconditioners (AmgX or Hypre) on GPUs, when compared to their CPU counterparts; and (2) limited benefit from MPI GPU-Aware communication, which has not proven effective with this solvers.

thumbnail Fig. 13.

Multi GPU weak scaling.

However, the GPU implementation remains favorable at scale, achieving substantial throughput even for the largest simulations tested.

7. Conclusion and prospects

This work presented an overview of the TRUST open-source CFD platform, highlighting its HPC capabilities, architecture, and recent advances toward hybrid CPU/GPU computing. Thanks to its modular design, object-oriented core, and proven robustness, TRUST has demonstrated excellent scalability on modern HPC systems.

A concrete example was presented with the largest DNS simulation performed to date with TRUST, involving 2 billion cells across 50,000 MPI processes. Conducted in the context of hydrogen safety research, this simulation confirmed that platform’s ability to efficiently resolve complex physics and provide a reference 3D solution for this problem.

Looking forward, the integration of GPU-accelerated solvers (AmgX, rocALUTION and AMGCL) and the porting of critical compute kernels (using Kokkos) marks a critical step toward adapting TRUST for emerging heterogeneous architectures. The initial results are promising, and ongoing efforts aim to further generalize and optimize the platform across various GPU vendors.

The overall performance results on a complete physical application shows the great potential of porting the application to support GPU acceleration, with speed up factor reaching up to 6.7× on recent hardware. In terms of scalability, the first results are encouraging, but work still needs to be done to improve solver scalability on multiple nodes, and to get closer to the ideal scaling curve.

Future work will focus on broadening GPU support across all numerical kernels and discretizations, refining portability strategies, and extending the platform’s capabilities to new application domains. Attention will also be drawn to a better performance on multiple GPU simulations, especially exploiting direct GPU communication between the nodes (without passing through the CPU).

Acknowledgments

The authors would like to express their sincere gratitude to the entire TRUST development and support team for their valuable support and continuous efforts in maintaining and evolving the platform.

Funding

This research was supported by an internal project funded by the French Alternative Energies and Atomic Energy Commission (CEA). The computations were performed using resources allocated under this internal initiative.

Conflicts of interest

The authors declare that they have no conflict of interest.

Data availability statement

All the simulation setups, input data, and test cases used in this work are available on the official TRUST GitHub repository: https://github.com/cea-trust-platform/TRUST. Please refer to the “tests” directory for detailed examples and configuration files corresponding to this study.

Author contribution statement

E. Saikali: Writing – Original Draft Preparation, Conceptualization, Methodology, Resources. A. Bruneton: Writing – Original Draft Preparation, Conceptualization, Methodology, Resources. P Ledac: Review, Formal Analysis, Conceptualization, Resources. R. Bourgeois: Review, Formal Analysis, Conceptualization, Resources.

References

  1. TRUST, CEA-DES open-source CFD platform, https://cea-trust-platform.github.io/ [Google Scholar]
  2. C. Calvin, O. Cueto, P. Emonot, ESAIM: Modélisation mathématique et analyse numérique 36, 907 (2002) [Google Scholar]
  3. P.E. Angeli, U. Bieder, G. Fauchet, Overview of the TrioCFD code: Main features, VetV procedures and typical applications to nuclear engineering, in NURETH 16 - 16th International Topical Meeting on Nuclear Reactor Thermalhydraulics (2015) [Google Scholar]
  4. E. Saikali, A. Bruneton, P. Ledac, Performances of the CFD open-source HPC platform TRUST on GPUs, EPJ Web Conf. 302, 03004 (2024) [Google Scholar]
  5. C. Reiss, A. Gerschenfeld, E. Saikali, Y. Gorsse, A. Burlot, Presenting the multi-phase solver implemented in the open source TrioCFD code based on the TRUST HPC platform, EPJ Web Conf. 302, 03001 (2024) [Google Scholar]
  6. U. Bieder, E. Graffard, Nucl. Eng. Des. 238, 671 (2008) [Google Scholar]
  7. E. Saikali, G. Bernard-Michel, A. Sergent, C. Tenaud, R. Salem, Int. J. Hydrogen Energy 44, 8856 (2019) [Google Scholar]
  8. E. Saikali, A. Sergent, Y. Wang, P. Le Quéré, G. Bernard-Michel, C. Tenaud, Int. J. Heat Mass Transfer 163, 120470 (2020) [Google Scholar]
  9. METIS, Serial Graph Partitioning and Fill-reducing Matrix Ordering, https://github.com/KarypisLab/METIS [Google Scholar]
  10. S. Balay et al. PETSc Web page, http://www.mcs.anl.gov/petsc (2015) [Google Scholar]
  11. SALOME, Open-source platform, https://www.salome-platform.org/news [Google Scholar]
  12. The HDF Group, Hierarchical Data Format, version 5, https://github.com/HDFGroup/hdf5 [Google Scholar]
  13. CGNS, CFD General Notation System, https://cgns.github.io/ [Google Scholar]
  14. E. Saikali, P. Ledac, A. Bruneton, A. Khizar, C. Bourcier, G. Bernard-Michel, E. Adam, D. Houssin-Agbomson, Numerical modeling of a moderate hydrogen leakage in a typical two-vented fuel cell configuration, in International Conference of Hydrogen Safety (2021) [Google Scholar]
  15. M. Naumov et al., SIAM J. Sci. Comput. 37, S602 (2015) [Google Scholar]
  16. rocALUTION, Library website, https://rocm.docs.amd.com/projects/rocALUTION/en/latest/ [Google Scholar]
  17. AdAstra, Cluster website, https://www.cines.fr/calcul/adastra/ [Google Scholar]
  18. SFINAE cppreference, Substitution Failure Is Not An Error, https://en.cppreference.com/w/cpp/language/sfinae [Google Scholar]
  19. PDI, Introduction to the Portable data interface, https://www.eocoe.eu/video_resource/pdi-introduction-to-the-portable-data-interface/ [Google Scholar]
  20. F. Barre, M. Bernard, Nucl. Eng. Des. 124, 257 (1990) [CrossRef] [Google Scholar]
  21. B. Cariteau, I. Tkatschenko, Int. J. Hydrogen Energy 37, 17400 (2012) [CrossRef] [Google Scholar]
  22. B. Cariteau, I. Tkatschenko, Int. J. Hydrogen Energy 38, 8030 (2013) [Google Scholar]
  23. B. Fuster et al., Int. J. Hydrogen Energy 42, 7600(2017) [Google Scholar]
  24. G. Bernard-Michel, D. Houssin-Agbomson, Int. J. Hydrogen Energy 42, 7542 (2017) [Google Scholar]
  25. S. Hamimid, M. Guellal, M. Bouafia, Thermal Sci. 20, 1509 (2016) [Google Scholar]
  26. B. Müller, Lecture series-van Kareman Institute for fluid dynamics 3, E1 (1999) [Google Scholar]
  27. G. Bernard-Michel, B. Cariteau, J. Ni, S. Jallais, E. Vyazmina, D. Melideo, D. Baraldi, A. Venetsanos, in Proceedings of ICHS 2013 (2013) [Google Scholar]
  28. S. Giannissi, V. Shentsov, D. Melideo, B. Cariteau, D. Baraldi, A. Venetsanos, V. Molkov, Int. J. Hydrogen Energy 40, 2415 (2015) [Google Scholar]
  29. D. Demidov, Lobachevskii J. Math. 40, 535 (2019) [Google Scholar]
  30. R. Chandra, Parallel programming in OpenMP (Morgan kaufmann, 2001) [Google Scholar]
  31. C.R. Trott et al., IEEE Trans. Parallel Distrib. Syst. 33, 805 (2022) [CrossRef] [Google Scholar]
  32. P.-Y. Chuang, L.A. Barba. AmgXWrapper: An interface between PETSc and the NVIDIA AmgX library. J. Open Source Software 2, 16 (2017) [Google Scholar]
  33. R. Boisvert, R. Pozo, K. Remington, B. Miller, R. Lipman, Matrix Market (National Institute of Standards and Technology (NIST), Gaithersburg (Maryland), 2004), http://math.nist.gov/MatrixMarket [Google Scholar]
  34. CCRT, Cluster website, https://www-ccrt.cea.fr/ [Google Scholar]
  35. R.D. Falgout, U.M. Yang, Hypre: A library of high performance preconditioners, in International Conference on Computational Science (Springer, 2002), pp. 632–641 [Google Scholar]
  36. NVIDIA, NVIDIA Nsight Compute, https://developer.nvidia.com/nsight-compute [Google Scholar]

Cite this article as: Elie Saikali, Adrien Bruneton, Pierre Ledac, Remi Bourgeois. TRUST: the HPC open-source CFD platform – from CPU to GPU, EPJ Nuclear Sci. Technol. 11, 53 (2025). https://doi.org/10.1051/epjn/2025050

All Tables

Table 1.

AmgX performance on NVidia GPUs. Tps refers to the time taken to solve a linear system once the setup is done, Its refers to iterations to solve the linear system.

Table 2.

rocALUTION performance on AMD GPU.

Table 3.

AMGCL performance on a TRUST matrix – NVidia GPU.

Table 4.

Performance comparison between OpenMP and Kokkos implementations.

All Figures

thumbnail Fig. 1.

Series of mesh types supported by the TRUST software.

In the text
thumbnail Fig. 2.

Main TRUST modules and tools.

In the text
thumbnail Fig. 3.

Iso-contours of the ratio δ/η in the mid-vertical xz-plane.

In the text
thumbnail Fig. 4.

Instantaneous vorticity magnitude iso-contours in the mid-vertical xz-plane.

In the text
thumbnail Fig. 5.

Iso-contours of velocity magnitude near the release zone, showing enhanced resolution of turbulent structures with Mesh 2 (right).

In the text
thumbnail Fig. 6.

Vertical profile of time-averaged hydrogen volume fraction near the wall. DNS: solid line; experiment: symbols.

In the text
thumbnail Fig. 7.

Implementation of the AmgX solver as a TRUST solver.

In the text
thumbnail Fig. 8.

Instantaneous temperature iso-contours in the injection vicinity depicting the turbulent jet spreading in a mid vertical 2D section.

In the text
thumbnail Fig. 9.

Overall performance comparison between a CPU compute node (AMD Milan) and various NVidia GPU node (V100, A100, H100). Ordinates: arbitrary time units (time per resolution timestep).

In the text
thumbnail Fig. 10.

TRUST speedup on GPU nodes vs CPU nodes (left) and execution time distribution on GPU (bottom right).

In the text
thumbnail Fig. 11.

Multi GPU strong scaling.

In the text
thumbnail Fig. 12.

Tuning of Hypre Algebraic Multi-Grid preconditioner (BoomerAMG).

In the text
thumbnail Fig. 13.

Multi GPU weak scaling.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.