Artificial intelligence: Deep learning in oncological radiomics and challenges of interpretability and data harmonization

may help classification and prediction in clinical practice. However, most of them are using limited datasets and lack generalized applicability. In this study we review the basics of radiomics feature extraction, DNNs in image analysis, and major interpretability methods that help enable explainable AI. Furthermore, we discuss the crucial requirement of multicenter recruitment of large datasets, increasing the biomarkers variability, so as to establish the potential clinical value of radiomics and the development of robust explainable AI models.


AI in oncology
Healthcare is expected to be highly impacted by machine learning (ML)-based artificial intelligence (AI). As deep learning (DL) relying on neural networks trained with large datasets has demonstrated state-ofthe-art performances in numerous applications, massive structural changes in information and data processing in this sector are expected. Oncology is especially targeted by these developments, cancer being a major worldwide issue (18.1 million cases and 9.6 million deaths in 2018, respectively 22 and 13 million projected for 2030) [1]. Regarding predictive modeling based on multimodal medical imaging such as CT (computed tomography), PET/CT (positron emission tomography / CT) or MRI (magnetic resonance imaging), both academic and private research rely on ML/DL methods, however their clinical implementation and acceptability are currently lacking.
For decades, in medical oncology, patients suffering from cancer underwent diagnosis imaging acquisitions including PET/CT/MRI, where anatomical and functional information were combined to provide prognosis of the disease and an effective treatment plan. The extensive use of advanced hybrid imaging scanners increased the amount of diagnostic data in daily routine, enhancing the need of computational support for fast and accurate diagnosis [2]. Daily clinical applications seem to take more and more advantage of the rapid developments of AI alongside the evolution of computer science. Applications of medical imaging in oncology and image-guided radiotherapy include early diagnosis, staging, treatment decision and planning, monitoring, and patient follow-up [3]. A patient's management may be optimized based on predictive models which are able to identify patients at risk of future treatment failures and recurrence. As some patients do not respond fully to the standard of care, different therapeutic strategies could be established based on these predictions. In addition, it is crucial to integrate data from several sources (clinical, imaging, dosimetry, genetics, toxicity, etc.) to improve predictive ability [4].
Besides automation in different stages of image processing, ML/DL opened a new era in clinical oncology, providing a more exhaustive and fast extraction of features from the diagnostic data, including some that may not be directly captured by the naked eye, including the expertly trained one. Quantification analysis of such features alongside with the combination of conventional anatomical and functional characteristics could further characterize tumors' profiles such as aggressiveness or potential of response to therapy, thereby informing clinical decision [5,6]. Radiomics and biomarkers selection and quantification are strongly interdependent with advanced ML/DL algorithms, which should be carefully used and extensively evaluated before being deployed in clinical practice. There are still several limitations and challenges to be addressed in the clinical application of AI in oncology, including the explainability and interpretability of the models, the sensitivity of the features' extraction, the reproducibility of the quantitative feature selection and the harmonization of the data.

AI approaches using oncological biomarkers and radiomics
On the one hand, radiomics has been introduced as the highthroughput extraction of "engineered" (or "handcrafted") features from medical images [6]. It has the potential to provide a quantitative signature of tumors' characteristics that cannot be appreciated visually [7] and has shown promising results in identifying tumor subtypes and in predicting outcome [8] by relying on ML methods to exploit radiomic features in combination with clinical or other variables to build predictive models. The majority of radiomics studies have been focused on oncology applications. On the other hand, the use of DL and specifically convolutional neural networks (CNNs) in computer vision have led to state-of-the-art results in filtering, segmentation, classification, and synthesis (image-to-image translation) including for medical images [9]. For these applications, the amount of available data (i.e. labeled pixels/ voxels in the case of segmentation, filtering, synthesis) is usually sufficient for training deep networks. On the other hand, the attempts for predictive modeling in radiomics [10][11][12] where labels are on a patient basis (i.e., one label per 3D image volume, instead of per pixel/voxel) did not lead to very large improvements compared to the standard radiomics approach, showing in some instances similar but complementary predictive power (i.e., combining both approaches leads to better results), given the comparatively smaller amount of available training samples (for instance, several hundreds of patients in radiomics studies versus millions of images in ImageNet [13]). Nonetheless, the current research trends are clearly to rely more on DL-based techniques, as they may allow for a higher level of automation compared to the traditional workflow and may therefore facilitate its clinical translation.

Interpretability of radiomics
There is a plethora of research and review studies investigating the extraction of radiomics and the optimal combination with other diagnosis biomarkers, to be used in clinical applications. However, there is a clear limitation on the translation of such procedures in oncology practice and their explainability in terms of clinical routine. The majority of the available studies lack concrete, reproducible results, applicable to a larger set of applications and differentiated data [14]. There is a big challenge in the scientific community to translate and effectively use the multi-parametric models combining advanced mathematical models with numerous variables of clinically derived biomarkers [15,16]. In several studies there is the attempt to apply ML/ DL methods in clinical routine applications. Such applications are described in Section 3.3.
Particularly in the context of DL the decision-making process of models is not transparent to humans and therefore interpretability is a crucial issue, especially in a potentially high-risk field such as radiomics.
Advantages of an interpretable model are a raised confidence that the model will behave in the expected way when presented with unseen data and also a higher trust and acceptance of models by end users, e.g. physicians. Interpretability is therefore an important challenge that needs to be tackled in order to facilitate clinical implementation of DL models.
In this review we present an overview of the state-of-the-art of Deep Neural Networks (DNNs) on oncological applications, using radiomics. There is a focus on the latest developments and the future perspectives regarding interpretability and harmonization of imaging biomarkers. In Section 2 we present the main radiomics classification with their definition and also take a brief look at the different approaches of extracting radiomic features from medical images. In Section 3 the focus shifts on DNNs, first explaining in general the architecture of neural networks, multilayer perceptrons (MLPs) and CNNs, which are widely used in medical imaging. In Section 3.3 we present several clinical applications of DNNs in oncology, highlighting their advantages as well as possible drawbacks. The two final sections are concerned with two of the major challenges to the clinical application of DNNs, namely model interpretability and multicenter harmonization. In Section 4 the subject of Explainable Artificial Intelligence (XAI) is introduced through three major interpretability methods. Finally, in Section 5 are presented methods of processing imaging data for use in AI models that tackle issues associated with data curation, medical confidentiality, multicenter harmonization, expanding datasets and model generalization.

Feature based radiomics
Conventional radiomic approaches are usually known as featurebased radiomics, which are automatically or semi-automatically derived from medical images. Some of these features aim to the maximum exploitation of available diagnostic clinical data, by uncovering "hidden", difficult or impossible to appreciate with the naked eye, features for clinical use.
The standard approach to extract radiomics features requires the definition of the Volume or Region Of Interest (VOIs/ROIs) in the applied images. There are recent studies, showing the enhanced quality of information derived when hybrid imaging data (PET/MRI, PET/CT) is used in contrast to the use of each one modality alone [17,18]. In order to enable high reproducibility and interpretability of radiomics, a welldefined processing procedure of the data is required prior to the calculation of the handcrafted features themselves. Such processes are analytically described in Section 4. There are a large number of features (even more than 1000), based on mathematical models, usually considered in radiomic studies and they can be categorized into 4 main groups [2]: 1. Shape features [19]: provide quantitative description of geometric properties of the ROIs/VOIs, such as surface area, total volume, diameter, sphericity or surface-to-volume ratio. 2. First order statistics (histogram-based features): describe the fractional volume for the selected region of voxels and the distribution of the voxels' intensity, for example minimum, maximum, mean, variance, skewness, or kurtosis. 3. Second order statistics (textural features): These features are extracted based on matrices derived from intensity relationships of neighboring voxels in a 3D image [20], such as: a. Gray Level Co-occurrence Matrix (GLCM): describes the spatial distribution of gray level intensities within a 3D image [21].
b. Gray Level Run Length Matrix (GLRLM): is defined as the number of contiguous voxels that have the same gray level value and it characterizes the gray level run lengths of different gray level intensities in any direction [22]. c. Gray Level Size Zone Matrix (GLSZM): quantifies gray level zones, the number of connected voxels that share the same gray level intensity, in a 3D image [23]. d. Neighbouring Gray Tone Difference Matrix (NGTMD): quantifies the difference between a gray value and the average gray value of its neighbours within a distance δ [24]. e. Gray Level Dependence Matrix (GLDM): quantifies the number of connected voxels within a distance δ that are dependent on the center voxel [25].
Second order features include entropy, uniformity, contrast, homogeneity, dissimilarity and correlation.
4. Higher order statistics features: These features are obtained by statistical methods after applying filters or mathematical transformations to the image, in order to highlight repeating patterns, edges, histogram-oriented gradients, or local binary patterns of the segmentation. These include fractal analysis, Minkowski functionals, wavelet and Fourier transformations, as well as Laplacian transformations of Gaussian-filtered images, which can extract areas with increasingly coarse texture patterns [26].

Deep learning radiomics (DLR) features
Deep learning based radiomic (DLR) features are obtained by normalizing the information from deep neural networks, especially CNNs, designed for image segmentation. The main hypothesis here is that once the image has been segmented accurately by a DNN, DNND the information about the segmented region is already stored within the network [27].
The first layer of an image processing DNN, whose architecture is described in detail in Sections 3.1 and 3.2, generally implements nonlinear template-matching at a relatively fine spatial resolution, extracting basic features of the data, thus detecting primitive patterns such as lines and edges. Subsequent layers learn to recognize particular spatial combinations of previous features, generating patterns of patterns in a hierarchical manner [28]. The higher layers of a deep neural network can often produce higher level features, which when the deep neural network's input is a medical image can be similar to the handcrafted radiomics features. These deep learning based radiomics features can be extracted from the last layers of the network. In this way a DNN can be used to convert 3D images into 1D vectors to allow medical image processing through deep learning, i.e. in an end-to-end fashion, or conventional machine learning methods.
The effectiveness of the deep learning radiomics features is highly related to the quality of the segmentation and the volume of the training dataset [29]. Therefore, in contrast to feature-based radiomics, large datasets are necessary to identify a relevant and robust feature subset. One other limitation of deep learning-based radiomics is the high correlation between the features and the input data, as the DLR features are generated from that very data without the application of prior knowledge [2].

Deep learning
Conventional machine learning had limited success in translating radiomic features into improving classification and prediction of cancer in clinical settings. Recently, deep learning has shown great potentials to improve feature engineering, classification, and prediction in medical imaging [9]. In this section, we review fundamentals of DNNs and CNNs.

Neural networks and multilayer perceptron (MLP)
To classify and predict clinical outcomes, supervised learning algorithms are trained on explanatory variables (e.g., input features) and response variables (e.g., output labels). In radiomics, classification tasks include diagnosis or prediction of response to therapy (e.g., benign vs. malignant lesions, responders vs. non responders to chemoradiotherapy), whereas regression tasks include time-to-event prediction (e.g., disease-free survival).
Generally, deep learning models consist of layers of connected neurons ( Fig. 1), where the single neurons are defined through simple activation functions. By combining a large number of nodes and layers, deep learning can learn complex and nonlinear functions between input features and output labels, achieving high performance in a variety of computer vision problems [30,31]. In a MLP, input features (such as medical images) are trained against output labels, while adjusting parameters to maximize prediction accuracy (Fig. 1).
The network transforms an input into an output by a process called forward pass which consists in taking, in each layer, a weighted sum of inputs (resulting in z) and applying an activation function (f), usually the logistic function f(x) = 1/(1 +exp( − x)) or the rectified linear unit (ReLU) f(x) = max(0,x). The purpose of such an architecture is to find a (non-linear) combination of the input features such that the classes in consideration become linearly separable [32].
A hidden layer(s) is essentially performing automated feature engineering, which finds informative combinations of input features. In conventional radiomics, the process of finding suitable combinations of input features has to be performed manually which is referred to as feature engineering. Handcrafted features are derived using expert knowledge and some of them could be highly informative of cancer, whereas others could be irrelevant. An arduous process for feature evaluation and selection is therefore needed to obtain accurate models. The introduction of a hidden layer, or several hidden layers in case of a deep architecture, automates this process, by using an iterative process of feeding labeled data into the network and updating parameters (weights and biases) in a process called backpropagation [33]. Hence the network learns directly from the data which features are relevant for the task at hand.
In practice, one commonly updates the weights using stochastic gradient descent [34], which uses an estimate obtained from a randomly chosen subset of the training dataset. Updating of the weights is repeated for many subsets until the loss function is not decreasing anymore and the model has converged. The performance of the trained model is evaluated on a test dataset, which has been held out from training. If the performance of the model is significantly worse in a test dataset than on a training dataset, it may suggest overfitting, where the model has adjusted to inconsequential peculiarities during training and does not generalize well beyond this particular training dataset. While we may attempt to reduce the model complexity in a conventional overfitting context, deep learning takes advantage of over-parameterized regimes [35]. In over-parameterized deep learning models, one may combat overfitting with data augmentation [36], and weight regularization [37 38]. Additionally, cross validation could be used to select the best performing model out of multiple models under consideration [39].

Convolutional neural networks (CNN)
Multilayer perceptrons are not well suited to classify image data. First, the array representing the image has to be flattened into a onedimensional input vector, removing spatial structures. Second, the MLP is not shift invariant such that a displacement of an input image fails the trained classification task. CNNs [40,41] overcome these challenges, accepting and being robust against shift of images or objects (Fig. 2).
A CNN typically consists of (a) convolutional layers that perform feature extraction, that are connected to (b) a MLP whose labels are response variables. The convolutional layers are organized in feature maps whose units are related to local patches of the previous layer through a small array of weights called a kernel or a filter. The value of each unit is obtained by calculating the weighted sum of activations of the previous layer using the kernel and applying an activation function. The process of obtaining the feature map is referred to as a discrete convolution of the kernel and the previous layer, hence the name. In a simplified form, it can be written as where f is the activation function, w is the kernel, z l− 1 is the feature map in the previous layer and b is the bias. Intuitively, the convolution can be understood as scanning the image with a kernel and storing, for each position of the kernel, the result in the feature map. The weights are shared within each feature map, resulting in shift invariance and reduction of parameters.
DL models with CNNs typically learn a hierarchy of features, where higher-level features are composed of lower-level ones. As an illustration, the first layers may learn edges, which are then combined to shapes and parts which comprise the objects to be classified. This composition of features explains why it is crucial to downsample the image or feature maps via pooling [42] or larger strides [43], because in this way kernels in the deeper layers "see" a larger portion of the original image. The training of the network can be performed just in the same way as MLP, namely by using backpropagation of the loss to update the weights.
CNNs have been hugely successful in computer vision and excel at classification [31,44], object detection [45,46], and segmentation [47,48]. They have also been used in other fields such as speech recognition [49] and natural language processing [50].

Applications of deep learning in medical imaging
In recent years there has been a surge of applications of deep learning techniques in medical image analysis (see in-depth reviews in [9,51]). In many cases the proposed models perform as well or even outperform health-care professionals, for example in the classification of diseases [52]. Here we review selected applications structured by the tasks which they perform, namely classification, detection, segmentation and registration. Table 1 summarizes the several applications incorporating the performed techniques.

Classification
The problem of classification of medical images can be divided into two subproblems [9]: image/exam classification and object/lesion classification. Image classification considers an image as a whole to predict a diagnostic output, e.g. presence of a certain disease. Object classification on the other hand is concerned with the classification of predefined patches of an image, e.g. whether a nodule is benign or cancerous.
In image classification, especially in medical imaging, transfer learning is a very popular approach due to the comparatively small number of available images for a given task. Transfer learning uses the convolutional layers of a classifier previously trained on a different dataset as a feature extractor which especially for small datasets leads to improved accuracy. This approach has been successfully applied, for example, in the classification of skin cancer [53] and diabetic retinopathy [54] with accuracy comparable to human experts.
Object classification is more involved in the sense that it requires global information about the location of the object as well as local information about the object itself. For this reason, pretrained networks can not so easily be utilized and a so-called multi stream architecture is a popular approach. In [55] several CNNs are trained on different scales of nodule patches and the extracted features are combined and fed into the MLP and [56] uses a similar approach but considers multiple resolutions instead of scales.

Detection
In computer vision object detection seeks to locate and identify instances from a predefined number of classes in images, where usually the location of the objects is indicated by rectangular bounding boxes. Specifically, in medical image analysis one commonly differentiates the tasks of localization of anatomical structures and detection of objects and lesions.
Most approaches to identify anatomical structures in 3D images translate the problem into a 2D classification problem. The basic idea is to first train a CNN on orthogonal slices of the 3D volume to classify the presence of a certain structure and to subsequently obtain the Fig. 2. Typical architecture of a CNN. In the first stage feature extraction is performed using convolutional and pooling layers, typically there are several such layers connected to each other which makes the network "deep". The second stage consists of a MLP which is using the extracted features to perform class predictions. localization of it by calculating the intersection of slices which have been predicted to obtain the structure. This approach has been successfully applied to automatically localize landmarks on the distal femur [57] and the heart, aortic arch and descending aorta [58].
In order to perform object or lesion detection many authors perform pixel wise classification, which is usually obtained through a sliding window technique [59]. Intuitively, the idea is to train a classifier on small patches of images and to obtain pixel-wise predictions by classifying the patch around the pixel. Since a convolution also consists of sliding windows (kernels) this approach can be performed very efficiently for CNNs by turning a classifier into a fully convolutional network [59]. Two selected applications of this technique are in histopathological diagnosis [60] and coronary calcium scoring [61]. In a recent study, 3D convolutional neural network (DeepMedic), was applied and evaluated to both detect and segment brain metastasis on MRI data [62]. Accordingly E. Grovik et al. [63], used a DL approach based on a fully-CNN, to demonstrate automated detection and segmentation of brain metastases on multisequence MRI data.

Segmentation
The purpose of medical image segmentation is to find structures of interest, such as tumors and lesions, and marking the constituting pixels with the same label. Deep learning techniques have proven to be very effective in this task and segmentation is in fact the problem which is most commonly tackled using CNNs [9].
The most well-known CNN architecture used for segmentation for medical images is U-net [64], which uses upsampling convolutional layers to obtain segmentation maps with the same resolution as the input. This architecture allows training the model using entire images end-to-end, which allows the model to utilize the whole context of the image. There exist several variants of U-net, most notably ones that allow processing of 3D images [65,66]. The segmentation of lesions requires to combine models for object detection and segmentation and has been successfully implemented in [67].
M. Soltaninejad et al. [68] investigated a supervised learning based multimodal MRI brain tumour segmentation technique using textural features from supervoxels in a limited number of clinical datasets, concluding that increased number of data could provide higher accuracy in the segmentation process. In addition, W. Deng et al. [69] developed a brain tumor segmentation method integrating fully convolutional neural networks and dense micro-block difference features and compared their results with traditional MRI brain tumor segmentation techniques. The study used BRATS 2015 (Brain tumor image segmentation benchmark) and the training of the algorithms was based on 100 patients with MRI brain tumor data. Another recent study was evaluated using BRATS for the performance of the detection of tumor regions in Glioma brain data. Features extraction applied and were used for training applying an Adaptive Neuro Fuzzy Inference system (ANFIS) approach for the classification of a brain image into a healthy or an abnormal -Glioma -brain image [70]. Recently, DNNs were applied in automatic segmentation of brain metastases. A dataset of ~500 imaging data were used for the evaluation of the method, resulting in sensitivity and specificity which varied according to the size of the lesions.

Registration
The registration of medical images seeks to align images by finding appropriate coordinate transformations that maximize a certain similarity measure.
Simonowsky et al. [71] use CNNs to construct such a similarity measure for two patches from different modalities. Using this measure, they are also able to derive optimized transformation parameters to spatially align the patches. In order to perform a 3D model to 2D X-ray registration Miao et al. [72] use CNNs to directly learn the transformation by training the network using artificial examples which have been obtained by manually adjusting the transformation parameters. DL approaches are extensively under investigation on lung radiotherapy applications. M. Foote et al. [73], designed a patient-specific motion subspace and a DNN to recover anatomical positions to define the 2D-3D deformation of the lungs. In addition, a recent study investigated the development and validated a robust and accurate registration pipeline for automatic contouring for online adaptive Intensity-Modulated Proton therapy (IMPT) for prostate cancer applications [74]. There are a plethora of registration applications in medical imaging utilizing DNNs [75].

Radiomics
The radiomics community has started relying on DL techniques, to address some of the remaining challenges and limitations of the usual radiomics workflow [76,77]. This includes automation of the detection and segmentation steps, as well as harmonizing images through synthesis generative methods (see Section 5.3). Some studies have also explored relying on one or several deep networks to achieve predictions either by extracting features (that are subsequently combined through standard machine learning techniques) or as an end-to-end tool up to the prediction task [12,27,[78][79][80][81][82]. Indeed, training a deep network from scratch on a limited size dataset can often be less efficient. One can thus extract "deep features'' from images using pre-trained networks. These "rough" to "fine" features at different scales through different layers can be exploited directly as well as combined with other handcrafted radiomic features to build even more accurate models [10,[83][84][85] as shown in some studies listed in the non-exhaustive Table 2 below. However, relying on DL methods in radiomics also requires addressing new challenges and facing several issues. These include the need for appropriate training with data augmentation techniques, constraints and prior knowledge due to the limited size of available datasets and their high level of heterogeneity, especially when training networks from scratch, as shown by some studies that achieved some success without having very large datasets to train their networks, as listed in the Table 3 below Another issue that has not yet been fully addressed even in recent studies is the lack of interpretability of the models built through the use of deep networks (see Section 4).

Explainable artificial intelligence (XAI)
The high performance of end-to-end deep neural networks comes at the cost of high complexity and vast number of parameters. We may not be able to understand and explain why a deep learning model has made certain classifications in image analysis. This type of algorithm is often referred to as a "black box", in which we cannot comprehend internal decision processes. The final outputs (e.g., classifications or statistics) are accepted without justifications.
There are several benefits to expect from improved explainability of radiomics models, especially if they relied on deep learning methods. First, specialists can better understand how the models they develop learn from data, which can allow them to improve the models, especially in understanding how they potentially fail in new data. Second, nonspecialists and especially end-users such as physicians, could better grasp the inner workings of the tools they rely on to make decisions for patients' management, which will increase their confidence in relying on them. In turn, confidence of patients in the tools will be increased if the physician can explain to them why he trusts the tool. Even though in principle one can follow every processing step, a huge number of parameters -e.g., the popular VGG-16 model has 138 million [86] -is making it infeasible to infer meaningful explanations of behaviors of the model in this way. Research in explainability and interpretability seeks to develop methods to reveal the behavior of a given model or to build models that are inherently more comprehensible for humans.
The concept of XAI is highly diverse, ranging from human computer interactions to visualization, and to interpretability metrics [87]. What it means for an algorithm to explain or how to evaluate interpretability are an active area of research and beyond the scope of this review. Instead, we focus on visual and statistical approaches that help us understand the application-based rationale behind deep learning models in the context of medical imaging. Understanding how exactly a model arrives at its predictions is important to ensure algorithmic fairness, identify a potential bias in a training dataset and to build trust that it performs on new data in an expected way [88]. Especially in sensitive fields such as radiomics, explainability is therefore a crucial criterion for widespread adoption. We summarize them in three categories:

Proxy models and model compression
Simpler and smaller models are more comprehensible as well as more efficient. Therefore, one may use more conventional statistical models to explain the operating characteristics of deep learning. A major challenge of interpreting deep neural networks is often raised due to the non-linearity of how input features are processed and incorporated into successive layers. Therefore, once deep neural networks are trained and Table 2 Some examples of studies comparing and combining a standard radiomics approach with a deep learning one (mostly extraction of "deep" features using pre-trained networks.

Study Application/ Endpoint
Image modality (ies)

Methods Conclusions
Paul et al. [85] Lung  demonstrate high performance, we can distill them into more conventional models [89]. Local Interpretable Model-agnostic Explanations (LIME) aims to explain a complex non-linear model by fitting a locally linear model in the vicinity of a certain prediction [90]. Beyond using a simpler proxy model merely for explanations, model compression seeks to capture the full spectrum (e.g., local and global) of accuracy while drastically reducing the number of parameters and complexities [91]. Particularly, Ba and Caruana [92] demonstrate that a shallow feed-forward net can learn the complex function previously learned by a deep model while maintaining accuracy. Hinton and Frost [93] devised a method to distill a deep learning model into a soft decision tree. In particular, they proposed to use predicted labels from a trained deep learning model, instead of a limited number of true labels, and to introduce adaptive penalties for regularization. They were able to build relatively compact decision trees with a slight reduction in prediction accuracy. Such soft decision trees can better represent a hierarchy of decisions that a human can interpret.

Visualization of intermediate features
Convolutional neural networks enabled high-performance deep learning in computer vision. For radiomics, convolutional layers can be seen as automated feature engineering that maximizes the prediction accuracy during training. Therefore, it is of great interest to identify which features have actually been learned by the convolutional layers. To this end, Olah et al. [94] proposed to perform a gradient ascent in the input space with respect to the activation of either a single unit in a feature map or for a whole feature map. Concretely, one starts with a pure noise as the input and iteratively changes its value in direction of the gradients in an optimization procedure. This leads to input images which maximally activate certain units or whole feature maps and therefore visualize what patterns the network is sensitive to.
Deconvolution [95,96] which is an inverse function of convolution, takes a different route to visualize learned features in convolutional layers. Essentially, once the model is trained, we set one of the output classes to one and other classes to zero and propagate through the network back to the input space. This backwards query maps the activation of the given output class back to the input and the resulting image can be understood as the network's internal representation of the output class [95,96]. Instead of starting from an output class, one may also arbitrarily start from an activation in any intermediate layer. The resulting image visualizes what shape or pattern this layer represents and is sensitive too.

Importance estimators and relevance scores
Input features, such as pixel or voxel values in medical images, are ultimately dictating classification. It is therefore of great interest to estimate the relative importance of input pixels for the classifications made by a model, i.e. to estimate which input pixels are the most relevant for a specific prediction. Because importance estimators can be visualized in the same dimensions as inputs, they are often referred to as saliency maps. There are two major approaches in which such a saliency map can be obtained. First, the perturbation methods measure the degradation of prediction accuracy, when small parts of the image are permuted, blurred, or generally perturbed [97][98][99].
Second, the gradient methods calculate the gradients of the class score with respect to the input pixels, where the class score is the activation of the neuron in the output vector corresponding to the class of interest [100,101]. There exist modifications of the standard method of gradient calculation via the chain rule. SmoothGrad [102] introduces imperceptible noises to the input image, which may result in more robust importance estimators. In Guided Backpropagation [43], negative gradients are set to 0, effectively discarding suppression of neuron activation. Rectified Gradient [103] generalizes this by allowing layerwise thresholding with an extra hyper parameter. Grad-CAM [104] calculates the gradient of the class score with respect to channels, i.e. feature maps, in the convolutional layers instead of the input pixels. Thus, instead of the importance of input pixels, rather the importance of high-level features learned by the intermediate layers is quantified. The resulting coarse saliency map can be upscaled to the input dimension and combined with aforementioned pixel-level fine grained saliency maps to obtain a high-resolution and class-discriminative importance estimator. Fig. 3 depicts a clinical example of tumors and gradient class activation maps (Grad-CAM) [12].
Due to a lack of ground truth and several related methods, one must be careful in using importance estimators. Particularly, many of the proposed importance estimators are motivated mainly by visual appeal, such as high contrast and reduced noise. Many of these de-noised saliency maps may result in strong biases that do not correspond to true interpretability of underlying deep learning models [105,106]. The degradation of prediction accuracy while masking important pixels has been used to evaluate saliency maps [107]. It has, however, been pointed out that the observed degradation is potentially not only due to removing important pixels but intertwined with a deviation from the distribution of natural input images [108].
Overall, these three major categories of interpretability methods are widely used in application of deep learning models, although they have not yet been extensively exploited to help explain DL based radiomics models. From simplifying complex models to visualizing features that are important for predictions, one should inspect and scrutinize models to better understand operating characteristics. Further development of XAI will likely contribute to facilitating clinical translation of deep learning based radiomics.

Data curation
A typical patient medical record today might have an abundance of information sourcing from a standard blood test, to more advanced imaging studies, i.e. Computed Tomography, etc., as well as various omics tests. From the advent of Computerized Tomography in the 1970 s, the amount of medical image data has been steadily increasing in the healthcare enterprise. A typical CT in the 1970 s contained ~ 40 5-mm slices, while today it can contain more than ~ 2 k 512 × 512 slices. Likewise, the various exams that a patient is prescribed have increased in information, complexity, come from various healthcare reference centers, and need to conform to the various guidelines and directives of the hospital and the National or International Healthcare System.
Although small datasets may suffice for the training of AI algorithms, large, well-curated datasets with associated annotations are deemed necessary for AI algorithms in the clinical setting [109]. To this direction the preparation and organization of data from various sources through their lifecycle, rendering them available for processing for research and/ or educational purposes is fundamental and is called data curation [110]. Data curation includes several steps like Ethical Approval, Deidentification, appropriate labeling and pre-processing, as well as specific dataset types.
The U.S. Health Insurance Portability and Accountability Act (HIPAA), and the E.U. General Data Protection Regulation (GDPR) require data de-identification of patient data. De-identification is the procedure of removing patient specific sensitive information, like name, address, contact information, to name just a few [111]. This type of identification information is present in various data, such as DICOM medical images. There are several toolkits that remove this sensitive information like Conquest DICOM software [112], RSNA Clinical Trial Processor (CTP) [113], K-Pacs [114], DICOM library [115], DICOMworks [116], PixelMed DICOMCleaner [117], DVTK DICOM anonymizer [118], YAKAMI DICOM tools [119], etc. Furthermore, they can opt to convert data to a different file format such as NIfTI (Neuroimaging Informatics Technology Initiative) [120] so as DICOM metadata sensitive information is removed, leaving only the image voxel size and patient position for the AI algorithm.
Currently developed AI instances are generally based on supervised learning approaches [121]. To this end, the (surrogate of) ground truth, which is usually an established diagnosis (e.g., based on biopsy) or a known outcome (response to therapy established after treatment, follow up registration of events such as recurrence or death), needs be linked to the image of the patient. After this procedure, which is called labeling, the AI algorithm can be trained and tested on datasets. Although supervised learning dominates the AI field, unsupervised and semisupervised learning can be used, especially when all or most of the data to be clustered/classified cannot be labelled using ground truth.
Another issue is that the dataset types may be coming from different manufacturers, vendors, institutions, countries (and thus different populations). If an AI algorithm is trained with data from a specific institution, on a specific vendor machine, and on a specific population, then the algorithm's performance might be overfitted to these and may not generalize well to other types of data. The AI algorithm should be thoroughly evaluated for generalization in other situations, where it might not work as efficiently or even completely fail. So, it is necessary that information used to train the AI algorithm come from various sources, or from a specific source if the algorithm is going to be deployed on a specific target population [122].

Multi-center-harmonization
Because most of previously published radiomics studies have been carried out using small, retrospective and monocenter cohorts of patients, the level of evidence regarding the potential added or complementary value of radiomics compared to clinical standard variables or simple metrics (such as in PET, metabolic volume and basic SUV max, peak or mean measurements) is considered to be rather weak [123,124]. In addition, the developed models are rarely tested on external datasets, even less often on several ones [125,126]. There is therefore a crucial need for the field to move away from the analysis of such small datasets, towards much larger ones in order to establish the potential clinical value of radiomics. With this need, comes the requirement of multicenter recruitment to achieve larger numbers. Another advantage of multicenter studies is the inherent variability in the data, which can make the cohort more representative and thus lead to more robust inference of models [127,128].
Collecting data from several centers is however complex for legal, ethical, administrative and technical reasons, although approaches such as distributed learning (a.k.a. federated learning in the machine learning community) consisting of the data not leaving the centers (only the models' parameters/weights are exchanged), can alleviate some of these issues [129]. Nonetheless, whether radiomic features are extracted from images stored locally in each center, or from images collected and stored in a centralized database, another major issue has to be considered. Indeed, most radiomic features have been shown to be highly sensitive to variations in several factors, including variability in scanner manufacturer, generation and models, acquisition protocols and reconstruction settings [130][131][132].
In some cases, factors involving relatively modest effects on the image characteristics from a visual point of view, can still have very important impact on some handcrafted radiomic features values and distributions (some being less robust than others to these effects). Thus, pooling these features together to perform any statistical analysis and build models can therefore lead to unreliable results, either hiding existing correlations or on the opposite, creating false discovery of correlative relationships [131,133]. Although it has not been extensively studied yet, using as input to CNNs PET/CT images with different characteristics and properties may also make the training of the network more complex or require more data than homogeneous datasets. On the other hand, a DL model could also benefit from heterogeneous data since it potentially leads to a model that is able to generalize better to unseen data. To understand the impact of harmonization in the context of DL therefore requires further investigation.
This variability in scanner manufacturers and models generations, acquisition protocols and reconstruction algorithms and settings are Fig. 3. Each row represents patients who did and did not developed distant metastasis in head and Neck Cancer. A) Raw images imported to the model, b) Gradient class activation map (Grad-CAM) of the penultimate convolution block, c) Merged image of columns a, and b. Red represents a region more significant to the designated classification (reproduced from the study of Diamant et al. [12]). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) currently a clinical practice reality, and will likely remain as such in the near future. In addition, one has to emphasize on the fact that this variability may also exist within a single center. For instance, when a PET/CT scanner is replaced by a newer model from the same manufacturer, or by another model by a different manufacturer, images before and after the replacement will likely have different characteristics and extracted features will exhibit some changes in response to those. Similarly, if the center has several scanners, differences in manufacturer / model / acquisition / reconstruction may also exist amongst patients of the cohort. Finally, a given scanner may be used differently by different radiologists/nuclear medicine physicians (favoring different reconstruction algorithms or settings for example). As such, there may be larger differences between images acquired within a given center using different scanners than between two centers relying on the exact same model and associated acquisition protocol and reconstruction settings. Therefore, the lack of harmonization procedures of images and / or radiomic features is also a potential limitation within a single center context. This has two important implications: i) first, when sharing image data for radiomic studies purposes, anonymization should be performed with caution, making sure that information relevant for the purpose of harmonization is kept in the DICOM files, such as for instance metadata about the scanner manufacturer and model, acquisition protocol (injected dose, etc.) as well as technical settings for the reconstruction (i. e., algorithm, implemented corrections, parameters, etc.); ii) when carrying out a radiomic analysis relying either on the extraction of handcrafted features or on the use of deep neural networks for feature extraction, metadata in DICOM files should be carefully checked to ensure proper data curation and extracting all the a priori knowledge about the acquisition and reconstruction of images in order to identify potential sources of bias and variability.
Taking these sources of variability into account is thus primordial for consistent and robust findings in any radiomics studies, even more so when multicenter data is considered. There exist a number of different approaches to address this issue, that can be classified in two groups: methods that address the issue in the image domain (i.e., before extracting features, either handcrafted ones or learning them directly from the images via a convolutional neural network) or in the features' domain (i.e., within or after the feature extraction step). On the one hand, addressing the issue in the image domain consists in harmonizing images directly so they have the same (or closer) properties (resolution, noise, texture, etc.). On the other hand, addressing it in the feature domain consists in harmonizing the features values, by either modifying how they are calculated (so they are less dependent on the varying factors in images) or directly modifying their distributions a posteriori so they can be pooled in the statistical analysis. Both approaches may also be combined, although this has not been extensively investigated yet. Most of the studies discussed below focus on one aspect or the other.

Standardization of imaging procedures
One way to reduce the variability of the image properties is to standardize the acquisition and reconstruction protocols to achieve more similar images, according to specific criteria. This is the case in PET/CT where guidelines have been specifically developed to achieve images with closer recovery coefficients and SUV measurements across scanners [134][135][136]. Indeed, these existing standardization guidelines are mainly focused on qualitative and basic quantitative measurements and do not specifically encompass radiomic features values and distributions as a criterion to aim for in achieving standardization. Although these long-lasting standardization efforts need to be consolidated and maybe expanded to better take into account radiomics, their ability to help in decreasing the variations in radiomic features distributions across different sites, may nonetheless remain insufficient to compensate for the existing (and here to stay) diversity of scanner models and manufacturers proprietary reconstruction algorithms and postprocessing tools across the various clinical centers. One recent study evaluated the performance of existing standardization guidelines for PET/CT imaging (i.e., EARL (European Association of Nuclear Medicine (EANM) Research Ltd. [135]) to reduce the variability of radiomic features across different scanner models and reconstruction settings [137]. They relied on a 3D printed phantom scanned on different systems across several centers. The differences between features extracted from PET images reconstructed with each clinical preferred setting and those extracted from the EARL-compliant reconstructions were important. A large percentage of radiomic features exhibited significant differences, even after standardizing the imaging procedures (acquisition protocols, reconstruction settings). This approach is feasible only for prospectively collected images, where it is allowed to modify the acquisition parameters. However, the majority of radiomics studies are retrospective [138]. Therefore, they are carried out by collecting images that have already been acquired and reconstructed. To evaluate the impact of different reconstructions, requires the storage of the raw data, which is rarely done in daily practice [139]. For retrospectively collected images, an approach that can work on already reconstructed images is therefore necessary.

Processing images
One approach is to apply image processing techniques before handcrafted feature extraction or analysis through a CNN. A common and popular example of such pre-processing is interpolation of all considered images to a common voxel size and applying filtering techniques so they would have similar resolution and noise characteristics. It is not trivial to implement, as there exist dozens of algorithms for image interpolation and filtering, so figuring out the most effective combination could be quite challenging and time-consuming. However, as isotropic voxels are recommended in the specific context of handcrafted textural features calculation by the IBSI (image biomarker standardization initiative) guidelines [16], interpolation to a common isotropic voxel size is often performed as a default pre-processing step in recent radiomic studies, so if images with variable reconstruction matrix sizes are considered, it can be beneficial also, although the choice of the common size parameter might be tricky. CNN also usually require images of identical size to be input in the network, so they are also interpolated before being fed to networks [140].
It has been suggested that interpolating images to a common voxel size for the purpose of harmonizing images and obtaining comparable (i. e. poolable in the statistical analysis) handcrafted radiomic features may be insufficient to fully remove the center effect [141]. Filtering images to achieve a similar spatial resolution may be quite detrimental in terms of textural analysis, if the common lowest denominator is chosen [142], which means higher resolution images are smoothed, hence removing details.
Another promising recently developed approach consists in relying on image synthesis through deep networks, such as GANs. The idea is to synthetize images with more similar properties for the specific goal of harmonization, so that handcrafted radiomic features extracted from harmonized images are comparable, or to facilitate training of deep neural network modeling. A recent work investigated the effect of different reconstruction kernels on radiomic features and evaluated the benefit on handcrafted features reproducibility to train a CNN to convert images from one reconstruction kernel to another, in a database of 104 lung cancer patients [143]. It demonstrated that different reconstruction kernels led to most of the features having significantly different distributions (595 out of 702), whereas after the proposed CNN-based image conversion, a larger percentage of features did not exhibit significant differences anymore (57%, 403 out of 702). Almost half of the features continued to exhibit differences, however. Another recent work relied on a two-step framework for multicenter image-based standardization using conditional generative adversarial networks (cGANs) to harmonize multicenter MRI brain images [144], while another relied on bi-directional translation between unpaired MRI images through a cycleconsistent GAN that uses 2 generator-discriminator pairs to achieve harmonization of DCE-MR images of breast [145]. A third study implemented a dual-GAN framework to harmonize diffusion tensor imaging (DTI) derived metrics on neonatal brains, demonstrating improved harmonization performance compared to standard approaches including voxel-wise scaling, and ComBat [146]. However, these studies did not extensively evaluate the resulting impact on multicenter radiomic studies. One recent work did so in the context of multicenter CT images, by relying on a GAN trained on different datasets to learn how to harmonize from one domain to another. Then a lasso classifier to stratify patients according to survival was trained using 77 radiomic features and evaluated in a cross-validation framework across the different domains [147]. Results show that relying on harmonized images to extract radiomic features improved the performance of the lasso classifier by an average of 11% in area under the receiver operating characteristic curve (from 3 to 32%).

Selection of features based on their reliability
One strategy consists in eliminating radiomic features because they have been identified to be unreliable (i.e., exhibit unreasonable variations in response to small variability of acquisition and reconstruction settings [130] or in a test-retest framework [148] before even considering them in any statistical analysis [149]. This can help build models with higher validation performance when tested on new data, as the features included in the model are expected to be robust to potential differences in image properties. Another advantage of this approach is that it reduces substantially the amount of variables to deal with in the modeling step, which can facilitate selection of features and building of multiparametric models. However, a drawback to consider is the potential loss of clinically-relevant information carried by the discarded features. One can only hope that the predictive power can still be found in the remaining features. In addition, the identification of the features that are both sufficiently reliable and carry enough clinically relevant information needs to be performed for each combination of clinical application and type of imaging data to be most appropriate for each case.

Modifying the feature's definitions
Because a number of radiomic features have been shown to be dependent on the number of voxels included in the calculation [150], it has been proposed to revise the feature definitions themselves to remove or reduce this dependency by including the number of voxels in the mathematical formulation [151]. Coincidentally, this can contribute to reduce the differences between radiomic features due to being extracted from images with different voxel sizes (and therefore different number of voxels for similar volumes of interest), as it has been shown in texture phantom data acquired in 8 different CT scanners from 3 different manufacturers [151], further validated in images of lung cancer [152].

Normalization
A large number of statistical methods have been proposed for statistical normalization [153].
A number of studies specifically evaluated the benefit of normalization techniques for the purpose of correcting biases and differences in radiomic features due to variations in imaging devices, acquisition protocols or reconstruction. A method for feature correction and bias reduction due to difference in exposure in the CT acquisition was proposed, by learning from phantom and clinical data how to model the differences, and then applying that learned correction to features values, thereby demonstrating at least 2 times standard deviation reduction for 47 out of 62 features [154]. Another recent work trained a deep neural network to standardize radiomic and "deep" features across scanners models and acquisition and reconstruction settings, relying on a publicly available texture phantom dataset [155]. It also showed the ability to transfer the learned standardization to new data coming from unknown scanners. The use of normalization to obtain more robust predictive radiomic models for validation in external data was demonstrated by normalizing features separately for each dataset rather than performing the normalization for all datasets combined [156]. Another study relied on z-score normalization to harmonize radiomic features extracted from pretreatment MRI for building a model predicting response in a multicenter study of 275 cervical cancer patients from 8 different centers [157]. High performance was obtained for the predictive models, although the study did not report performance without the normalization.

Batch effect removal
ComBat is designed to estimate a batch-specific transformation to express all data in a common space devoid of center effects [158] and has been shown to provide satisfactory results even for small datasets [159]. An extensive comparison of the previously described normalization techniques in the specific context of radiomics has not yet been carried out, although previous comparisons between ComBat and similar techniques for batch effect correction in different fields (including genomics) indicated superiority of ComBat. Recently, a study compared ComBat with SVD decomposition and voxel size resampling in the context of CT imaging using phantom data and a clinical cohort of patients with colorectal/renal cancer liver metastases [160]. The results indicated that the best harmonization was achieved with ComBat.
ComBat was first evaluated for harmonization of radiomic features in the context of PET [133] imaging and was later evaluated for CT [161] and MRI [162]. It has been exploited in a number of radiomic clinical studies to improve results of predictive models: in FDG PET and MRI radiomics for locally advanced cervical cancers (accuracy improved from 76 to 81% before harmonization to 81-97% after ComBat applied to the three centers) [163] and in FDG PET/CT for early-stage lung cancer where features had lower predictive power without harmonization, and ComBat allowed validating the model trained in 3 centers when applied to the fourth one [164]. Its benefit was also evaluated in the context of DCE MRI images of breast cancer to differentiate 3150 malignant and benign lesions, where classification performance using harmonized features was significantly higher (p < 0.001) [165]. The method was recently used in a multicenter CT study to harmonize radiomic features extracted from the different CT scanners in order to build reliable models predictive of outcome of COVID-19 patients [166]. Unfortunately, this study did not report performance of radiomic features without ComBat. Finally, in a recent study, radiomic models were trained to identify malignant nodules in early diagnosis of lung cancer with low-dose CT and externally validated [167]. All models had a high performance in the external validation set (AUC above 0.82), and this was not significantly altered when relying on ComBat-harmonized features.
Combat therefore seems a promising operational and simple method to perform harmonization of radiomic features, providing the number of labels is reasonable and the sources of variations can be identified and labeled. In case of very high heterogeneity where the number of labels to use with ComBat would be too high with respect to the number of patients, unsupervised clustering can be relied upon to identify potential labels to use for harmonization [168]. To avoid features to lose their physical meaning after harmonization, a variant of ComBat allowing to select a reference to align other labels to (instead of averaging all distribution to an arbitrary grand mean), name M− ComBat, can be used with no loss of performance. Finally, the robustness of the estimation can also be improved through bootstrapping and Monte Carlo (B-Combat) [168].

Discussion and conclusions
In modern radiation oncology, AI techniques have found several applications in many research domains, ranging from image processing for diagnosis to the optimization of precise therapeutic protocols. The exploitation of imaging biomarkers and radiomics features provided a new metric for quantitative image analysis, aiming to support clinical decisions, in detection, characterization and treatment planning on several pathologies. DNNs seem to provide the potential for a revolution in the field of medical imaging and radiotherapy, opening a new era on the personalization of diagnostic and therapeutic radiation protocols [169]. Although the rapid and increasing developments of DL approaches in imaging biomarkers, the state-of-the-art methodologies in radiomics provide several limitations and need to address great challenges in terms of explainability, interpretability and homogenized approaches (multi-center data harmonization).
Despite the developments of DL models in radiomics, it is still an issue the concept of understanding and explaining the way that the classifications and the predictions are done. The concept of XAI, the metrics used and the evaluation of interpretability is highly debated and under investigation in the scientific community. Furthermore, the repeatability, transferability and reproducibility of radiomics are of high interest as they are dependent on each specific imaging acquisition. In a recent study, the authors investigated imaging biomarker radiomics that seemed to be repeatable and reproducible within the reviewed studies [170]. In this framework, IBSI provided guidelines and radiomics nomenclatures and definitions, to support the verification of feature extraction in the field of radiomics [16].
DLR features may provide advantages in DNNs showing higher generalization and transferability compared to feature based radiomics. However, even the models developed for DLR, they still lack reliability and explainability for their application in clinical practice.
It is of crucial importance to put effort on the standardization of the models and generalization, applying harmonization methodologies to enable the understanding of several published results. Small datasets, dependency on image acquisition protocols (data analysis, imaging modality, quality of image, processing methods), however, are still difficult problems which complicate reaching this understanding. There are already several open-source software, available in the scientific community for radiomics research, such as, Keras [171], TensorFlow [172], LifeX [173], MaZda [174], PyTorch and PyRadiomics [175]. Even though the procedures applied and the workflow in such packages are not simple and lack generalization, with result not to allow researchers to fully comprehend the results and even more to reproduce them.
Last but not least, the major issue of radiomics, is their interpretability in clinical routine. Till now, most of radiomics extraction and imaging biomarkers analysis are used as "black box", making it impossible to clinically translate their usage [29].
It is highly encouraging that the aforementioned limitations are now well-known and are recently widely discussed in literature, moving the research on more focused studies, on addressing the challenges, and finding ways for the clinical exploitation of AI developments in the field of radiomics.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.