Automatic chest computed tomography image noise quantification using deep learning

Purpose: This study aimed to develop a deep learning (DL) method for noise quantification for clinical chest computed tomography (CT) images without the need for repeated scanning or homogeneous tissue regions. Methods: A comprehensive phantom CT dataset (three dose levels, six reconstruction methods, amounting to 9240 slices) was acquired and used to train a convolutional neural network (CNN) to output an estimate of local image noise standard deviations (SD) from a single CT scan input. The CNN model consisting of seven convolutional layers was trained on the phantom image dataset representing a range of scan parameters and was tested with phantom images acquired in a variety of different scan conditions, as well as publicly available chest CT images to produce clinical noise SD maps. Results: Noise SD maps predicted by the CNN agreed well with the ground truth both visually and numerically in the phantom dataset (errors of < 5 HU for most scan parameter combinations). In addition, the noise SD estimates obtained from clinical chest CT images were similar to running-average based reference estimates in areas without prominent tissue interfaces. Conclusions: Predicting local noise magnitudes without the need for repeated scans is feasible using DL. Our implementation trained with phantom data was successfully applied to open-source clinical data with heterogeneous tissue borders and textures. We suggest that automatic DL noise mapping from clinical patient images could be used as a tool for objective CT image quality estimation and protocol optimization.


Introduction
Computed tomography (CT) is one of the primary volumetric imaging modalities used in radiology.This popularity is based on the continuous technical development and versatility of clinical applications offering fast scans with high spatial resolution and large anatomical coverage [1].Owing to this, CT contributes approximately 70 % of the cumulative radiation exposure to patients in diagnostic imaging [2].Concerns about radiation dose burden have accelerated the technical development of optimization methods to enable CT scans with the most beneficial balance between image quality (IQ) and radiation dose [3,4].Maintaining adequate IQ is an absolute requirement to secure reliable diagnostic information and provide correct care decisions targeted for the effective care of each individual patient [5].
Medical IQ is defined as the capability to provide accurate diagnostic (anatomical and functional) information [6].To measure IQ, the traditional approach involves applying technical phantoms and standardized exposure conditions to determine image noise, contrast, spatial resolution, and their derivatives (e.g., contrast-to-noise ratio and detective quantum efficiency) [7].Typical measurement conditions for these technical IQ evaluations use high radiation exposures, high contrast targets with regular geometry, and standardized measurement setups that differ significantly from clinical exposure conditions and individual patient anatomy.
Clinical imaging, in contrast, usually involves a lower and optimized radiation dose and varied contrasts and anatomical structures.This has typically inferred subjective IQ evaluation by a human observer (radiologist).Therefore, clinical IQ has been challenging to measure in an extensive, repeatable, and unambiguous manner.In recent years, radiomics feature analyses and model observers have been developed to make clinical IQ evaluations quantitative and objective [7][8][9][10].Anthropomorphic phantoms have been used to improve simplistic technical phantoms toward better resemblance with true patients.However, the ability of any artificial phantom model to represent the characteristics of a human being is limited.More specifically, these models do not provide a comprehensive surrogate for patients with anatomical and pathological variability, gender and age representations, tissue textures and compositions, and physiological motion [5,6], although elaborate virtual clinical trials are striving toward this level of detail [11].
To overcome these challenges, artificial intelligence, or more specifically deep learning (DL), has been proposed as a new and more flexible method for IQ assessment.DL could be used to extend IQ estimation from phantom measurements to automated assessment directly from clinical CT images.Clinical CT IQ estimation has primarily focused on noise and contrast estimation with traditional image processing methods [14][15][16].However, recent developments have extended the available toolset to include DL methods.The strength of DL in medical imaging arises from the abundance of available image data for training and testing, transferability of previously trained networks to other imaging tasks, and increasing access to open datasets [12,13].Grant et al. demonstrated that a pre-trained VGG19 network can be used to characterize the diagnostic quality (radiologist's assessment) of clinical lung CT scans with 76 % accuracy [17].In the study by Li et al., DL was used to estimate radiologist subjective evaluations of IQ scores (image noise, artifacts, edge and structure, overall IQ, and tumor size and boundary estimation) on a five-point scale [18].In the study by Lima et al., a technical phantom dataset from 43 CT devices was used to train a DL model (SqueezeNet), which performed a low contrast lesion detectability scoring with 96 % accuracy [19].DL has also been applied for screening potential CT protocols for dose optimization [20].Similar methods have also been developed for magnetic resonance imaging [21][22][23].
Noise removal is an active field of machine learning research [24].In contrast, pure estimation of noise magnitude has received less attention.The latter task can be assumed to be inherently easier because the aim is to estimate the (local) noise level instead of recognizing and separating the exact noise component from the data.In typical CT image noise measurements, the image intensity standard deviation (SD) is calculated either from a homogeneous phantom region or from a difference image from a stationary dual acquisition.These approaches may not be feasible for clinical images because large homogeneous regions or dual acquisitions are typically not available.Our recent research involved noise magnitude assessment by applying DL network models to head CT image data [25].Our wider objective is to extend the DL-based noise assessment methodology to other anatomical regions and further develop a tissue-specific IQ measurement framework for automated clinical IQ quantification to aid in quality control, optimization, and harmonization processes.This study applied DL-based noise estimation to a single clinical chest CT acquisition using a convolutional neural network (CNN) trained on dual acquisition phantom data.

Model and training
An anthropomorphic dosimetry phantom (CIRS ATOM 702-D, Norfolk, USA) was scanned using a Revolution EVO (GE Healthcare, Boston, MA, USA) CT scanner.Acquisition parameters (pixel size = 0.53 mm, slice thickness = 0.625 mm, DFOV = 27 cm, rotation time = 0.5 s, pitch = 0.984, scan area encompassing 476 slices, automatic tube current modulation enabled) were kept constant apart from tube peak kilovoltage (kV) and current, which were varied, resulting in five different scan protocols and three different volume CT dose index (CTDI vol ) levels at 1 mGy (low dose, 100 kV and 120 kV), 5 mGy (standard clinical dose, 100 kV and 120 kV), and 10 mGy (high dose, 120 kV) (Table 1).Three cnsecutive repetitions of each protocol were scanned, two for computing ground truth local noise maps and one for the independent CNN input.The resulting image datasets were reconstructed with filtered back projection (FBP) using "standard", "soft", and "bone plus" kernels as well as with GE's adaptive statistical iterative reconstruction (ASIR-V, 20 % and 40 % weightings) and the TrueFidelity DL-based reconstruction [26].
For each reconstruction, two of the three repetitions were used to compute a local noise estimate with where is the difference image of acquisitions x 1 and x 2 scaled with ̅̅̅ 2 √ to make SD values statistically correspond to a single image, and U( • ) denotes average filtering using a [5 × 5 × 5] kernel.Estimates were computed within phantom bounds using Hounsfield Unit (HU)-threshold-based masking with morphological opening, holefilling, and retrieving the largest connected component to leave air and the patient table out.
The 3D CNN architecture for estimating the noise maps consisted of seven convolutional layers (Fig. 2).The architecture was identical to that used in our previous study analyzing brain images [25].Nonoverlapping image patches of size 11 × 11 × 11 voxels were used as network input.Patches were extracted only from within the masks described above.The network output (single voxel) was compared with the corresponding voxel in the ground truth SD map and used for the mean-squared error (MSE) loss computation.
Data imaged at 120 kV were sequentially split into training, validation, and test sets.The phantom was divided into four slabs of 120 slices along the z-axis, and from each slab 78, 16, and 16 slices (74, 16, and in the last slab) were allocated to the training, validation, and test sets, respectively (Fig. 1).Five slices were excluded before and after the test slices to avoid a correlation between the test and training data.The images acquired at 100 kV were used only for testing (omitting the training and validation slice locations shown in Fig. 1).Thus, the total data volumes were 9240, 1920, and 3840 slices for training, validation, and testing, respectively (not accounting for the triple acquisition).
The CNN was built with the PyTorch DL framework (v.1.8.1) and trained for 50 epochs using the Adam optimizer (using standard values for decay rates β 1 = 0.9 and β 2 = 0.999), with an initial learning rate of 0.0001, a cosine-annealing learning rate scheduler (one cycle over epochs), and a batch size of 512.Input training data were scaled according to the mean and SD over the entire training input data, and the ground truth SD maps were normalized according to the minimum and maximum of the entire training ground truth SD maps.Validation and test data were similarly scaled using the values derived from training data statistics, and the outputs were scaled back to correspond with the true SD values. Testing.
The performance of the model was tested with three different test sets by applying the trained CNN model to compute the noise maps:     [27].Furthermore, a non-contrast-enhanced CT volume was obtained from the COVID-19-CT dataset [28].Clinical cases were chosen to test the model's generalizability and performance over varying scan parameters (Table 2).As scan repetitions were not available for the clinical data (i.e.ground truth noise SD maps could not be generated), comparative estimate noise maps were computed by subtracting a 5-slice running average from each slice and computing the map according to equation (1), where in this case X = /2 to make SD values correspond to a single image, and U( • ) denotes average filtering using a 5 × 5 × 5 kernel.Furthermore, lung lobe segmentation was computed for each clinical CT study using the lungmask pre-trained segmentation network [29], and noise statistics were analyzed in five volumes of interest per study: left upper lobe, left lower lobe, right upper lobe, right middle lobe, and right lower lobe.

Statistical analysis
The inter-rater reliability between the CNN estimates and the ground truth values in the phantom tests were computed using the intraclass correlation coefficient ICC(3,1) metric specific raters, mixed effects, and consistency, following the principles outlined in [30].

Phantom datasets: Test set 1
Lung noise SD maps computed by the CNN were visually similar to the ground truth noise SD maps (Fig. 3).Most dissimilarities were observed with low noise levels (iterative or DL reconstruction), where tissue interfaces were enhanced.Furthermore, the mean maximum absolute percentage error (MAPE) was below 17 % for all phantom datasets in test set 1 (Table 3).The largest MAPE values were observed with high-dose protocols and iterative or DL reconstructions, where the noise amplitude was low (Fig. 3, Table 3).Correspondingly, the lowest MAPE values were observed with high noise conditions, i.e., when the dose was low and sharper FBP kernels (such as the bone plus kernel) were used.
Similar findings were present in the modified Bland-Altman plots of the phantom datasets, which feature voxel-wise comparisons of the difference in the SD value between the CNN output and ground truth as a function of the ground truth SD value for a random 2000 voxel subset (Fig. 4).Contrary to the MAPE results, the absolute differences (ground truth value subtracted from CNN output) were lower at low noise conditions (high dose, iterative or DL) and higher when noise was more

Table 3
Descriptive metrics computed for test set 1. The mean standard deviation values in the ground truth and convolutional neural network (CNN) maps across the lung volumes of interest (VOIs), as well as their absolute and relative differences, are presented.In addition, the mean and standard deviation (STD) of the pixel-wise computed mean absolute percentage errors (MAPEs) are presented.dominant (low dose, sharp reconstruction kernels).A descending trend was observed in all datasets, showing that at low ground truth SDs, the CNN SD prediction tended to be overestimated, and correspondingly at high ground truth SDs, the CNN underestimated the prediction.Moreover, there appeared to be a structural component to the variance of the SD difference, as voxels closer to the isocenter output had higher SD estimates than voxels closer to the periphery (Fig. 4).Furthermore, the peaks of the noise histograms obtained with both methods occurred close to each other, whereas the overall shape of the histograms differed to varying extents in some reconstructions (Fig. 5).Finally, the mean noise SD values in the lung noise maps were in good agreement, showing absolute errors of less than 3 HU and relative errors of less than 10 % in all reconstructions (Table 3, Fig. 6A).The interrater agreement was excellent with ICC(3,1) of 0.999 between the

Phantom datasets: Test set 2
Similar Bland-Altman analysis showed that errors in the estimated mean noise values were slightly higher in the test data obtained with different scanners; however, most reconstructions exhibited less than 5 HU error (Fig. 6B).The inter-rater reliability was also excellent in this dataset, with ICC(3,1) of 0.997 between the mean SDs of the ground truth and CNN output.

Clinical dataset: Test set 3
CNN-based noise SD map estimates had less visible anatomical structure outlines, whereas the running-average based reference method overestimated the structural variance because consecutive slices were averaged in the noise estimate (Fig. 7).One of the selected cases (noncontrast enhanced exam) was noisier than the others but did not exhibit as large edge enhancement in the vessels as no contrast agent was used.As seen from the example segmentation (Fig. 7, top row), the opensource lung lobe segmentation network reliably found the lung lobes that were used for numerical SD comparisons (Fig. 8).In the numerical analysis, the lung mask of the contrast-enhanced cases was filtered to remove noise SD values greater than the 70th percentile of the reference noise SD map (determined empirically) to remove the bias generated by the enhanced edges.This resulted in a more realistic estimate with the running-average-based reference method while not notably changing the statistics of the CNN output.This additional masking was not required in the non-contrast enhanced case.In general, the CNN estimate resulted in similar statistics in the selected cases and lung lobes compared with the reference method (Fig. 8).
Finally, the linearity of the model response to additive noise was analyzed by adding varying amount of noise to a clinical image.The noise images were created by subtracting two (stationary repeated) phantom scans from each other and scaling with different weighting factors.In this coarse linearity test, the average of the CNN output inside the lungs showed a linear response with increasing noise (i.e., increasing weighting factor) (Fig. 9).

Discussion
In this study, a DL-based image noise estimator based on a single clinical chest CT scan was developed.Further analysis using automated lung segmentation provided targeted and automated noise estimation.Clinical chest CT image data represent highly variable characteristics of individual patients with varying body shape and thickness, different organ morphology and contrast, and individual pathological stages.Furthermore, the image characteristics varied because of different CT scanner models (providing vendor-specific reconstructions), which represent variable clinical scan protocols in individual anatomical sites.These facts support the notion that quality control metrics retrieved from actual clinical data are preferable to simplified phantom measurements.
Our proposed method was able to measure an estimate of the noise magnitude in clinical patient images without the need for repetitive scanning as required by conventional noise map computation, which would of course not be feasible due to the additional radiation exposure.This approach opens new possibilities in CT quality control, including harmonization of scan protocols, especially in large sites, and optimization of IQ based on different patient demographics.
Automated organ segmentation is beneficial for patient-specific IQ and dosimetry-related tasks when developing optimization and quality monitoring methods for medical imaging.Segmentation enables results to be targeted to clinically relevant organ regions, which is also of interest in traditional subjective radiologists' reviews of medical images.In our study, we used automated segmentation as a basic building block to adjoin our voxel-based noise results to organ regions, allowing even more targeted measurements of IQ.
Our results showed that DL-based noise estimation could produce noise SD maps with minimal errors compared with a golden standard estimate obtained with two scan repetitions in the phantom datasets.The inter-rater reliability was also excellent between the methods compared.We used a wide variety of scanning and reconstruction parameters in our training and testing data.Errors were higher with softer reconstruction kernels, iterative or DL-based reconstruction algorithms, and higher doses, i.e., in low noise conditions.Consistently, the errors were smaller at lower doses and with sharper reconstruction kernels.As the amplitude of noise decreases (even close to zero), the relative errors can be high even if the absolute error (in HU) is not.As shown by the Bland-Altman analyses, this was indeed the case as the absolute SD difference was lower in low noise conditions, and correspondingly, the SD differences became higher and more dispersed in higher noise conditions.
The phantom results also indicated a trend in the SD difference, where voxels closer to the isocenter represented higher SD estimates than voxels at the periphery.Thus, the developed noise estimation model seemed less sensitive to more structured noise at the periphery of the field of view, which may be interpreted as originating from an anatomical structure [31,32].
To further test the generalizability of the model, we collected additional datasets from scanners that were not used for training.We also varied the X-ray spectrum using 100 kV peak kilovoltage and tin filtering in the parameter combinations.While the errors were slightly higher, the inter-rater agreement remained excellent in the mean noise SD values.While the pixel-wise values vary, the deliverable metric of this method would indeed be the mean of the noise map in the analyzed region, with which the noise magnitude could be estimated.The excellent agreement of the CNN estimates compared with the goldenstandard noise map values indicates that this method could reliably be used to quantify noise magnitude in a quality control setting.
In the clinical test data results, similar HU statistics were observed in the noise SD estimates produced with the reference method and CNN method.In the case of clinical images, a reliable reference estimate is difficult to compute because scan repetitions are not available.Our approach was to use a slice-wise running average estimate that could capture at least some of the local deviation in the HU values.However, these reference estimates had severely over-enhanced structures arising from tissue boundaries and contrast agent.Because of these imperfections, the reference we used is not to be viewed as the true ground truth.The case with the highest deviation from the reference (case #4) indeed had a thicker slice thickness (3 mm) than the other cases, and thus severely overestimated boundaries in the reference estimate.A case was much noisier than the others (case #6), and our method resulted in slight overestimation compared with the reference.In the other cases, our method was in good agreement with the reference estimate (when the most dominant edge imperfections were removed).While more meticulous analysis is required for reliable statistics, this initial analysis showed that there is a prospect to validate this method in a larger  [33].The advantage of our method is that noise can be assessed locally.Compared with our previous study concerning noise estimation in head CT images [25], a similar magnitude of MAPE values in the phantom dataset, a similar descending trend line in the Bland-Altman plots, and a correspondingly lower tendency to boundary-induced overestimation was observed with the CNN method than with the reference method.
The limitations of our study were related mostly to the use of a single anthropomorphic phantom model as the primary DL training target and the lack of different scanner models and the corresponding variation in image acquisition.However, as numerical results given by data measured under different conditions (lower kV, another scanner manufacturers) were very similar to the results given by data measured under the same conditions as the training data, the model seemed robust to varying scanning conditions.A larger and more clearly defined indication-specific groups used as clinical test data would have added versatility and confidence to the clinical performance evaluation.
Overall, our results prove the feasibility of DL-based automated and objective measurement of image noise and contrast directly from clinical chest CT patient images.The measurement framework also enabled the targeting of IQ results to clinically relevant chest anatomy, including specified lung regions, by automated segmentation.This provides clear benefit when considering the use of our method as a part of automated clinical-level IQ monitoring by collecting scanner-and protocol-specific statistics on image quality metrics.

Conclusions
Our CNN model predicted noise SD maps without the need for repeated CT scans.It was successfully trained using a phantom dataset and applied to open-source clinical chest CT data with good results.CNN-based noise SD mapping is a promising tool for objective tissueand organ-specific CT protocol optimization using large clinical data volumes instead of simplified phantom scans.Fig. 8. Boxplots showing the statistics of the noise standard deviation (SD) in Hounsfield Units (HU) in the reference (shaded boxes) noise map estimates and the convolutional neural network (CNN) predictions (white boxes).Lines within the boxes refer to median values.The data in cases 1-5 have been thresholded to remove values higher than the 70th percentile in the reference data to account for the bias arising from the enhanced edges of the airways and vessels (this was not needed in case 6 as no contrast agent was present).Cases #1-6 are presented in Table 2. Noise is analyzed in the five lung lobes: left upper (LU), left lower (LL), right upper (RU), right middle (RM), and right lower (RL).

Fig. 2 .
Fig. 2. Schematic of the CNN model used in estimating the noise (standard deviation, SD) maps.The number of filters is shown at the top of each block.The input to the network is a 11 × 11 × 11 patch from the masked reconstruction volume, and the output is a single voxel containing the local noise SD value.The loss is obtained by comparing the ground truth noise value with the output value.Conv3, 3 × 3 × 3 convolution; ReLU, rectified linear unit; BN, batch normalization; Conv1, 1 × 1 × 1 convolution.

Fig. 5 .
Fig. 5. Histograms of the ground truth lung standard deviation (SD) maps (blue) and CNN-predicted lung SD maps (orange) measured using GE Revolution EVO CT scanner.The histogram peak positions matched closely while the shapes varied.ASIR-V 20, adaptive statistical iterative reconstruction with 20% weighting; ASIR-V 40, ASIR with 40% weighting; DLIR, deep learning iterative reconstruction (TrueFidelity); FBP, filtered back projection; SD, standard deviation; HU, Hounsfield Unit; CTDI vol , volume computed tomography dose index.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 6 .
Fig. 6.Modified Bland-Altman plots between standard deviation (SD) measurements given by the ground truth SD maps and estimation given by the convolutional neural network (CNN) (measured as the mean SD value from the map).A. Results for test set 1 (GE scanner).Inter-rater reliability was tested with the intraclass correlation coefficient ICC(3,1) accounting for specific raters, mixed effects, and consistency.ICC(3,1) = 0.999 indicated excellent agreement between ground truth and CNN measurements.B. Results for test set 2 (Siemens and Canon scanners).ICC(3,1) = 0.997, indicating excellent agreement between ground truth and CNN measurements.

Fig. 7 .
Fig. 7. Example slices from the clinical datasets.The reference method was computed via slice-by-slice running-average mean subtraction followed by standard deviation (SD) calculation.Prediction from the convolutional neural network (CNN) was computed using the model trained with phantom data.Lung lobe segmentations were obtained using an open-source deep learning-based software (example visualized for case #1).Cases #1-6 are presented in Table 2. Windowing: Input reconstructions W:1500 L:-600; Reference and prediction maps W:200 L:100 except for Case #6 W:300 L:150.

Fig. 9 .
Fig. 9. Linearity of the model output.When additional noise is imposed as a function of the square root of a weight parameter (w), the variance of the model output (calculated as the mean of the standard deviation (SD) map raised to the power of two) behaves linearly.The dots represent individual measurements, and the dashed line is the linear regression fit.

Table 1
Parameters used in different phantom dataset scans.The parameters were combined, resulting in a total of 30 variations in the training and validation sets and in test set 1.For test set 2, a total of 14 variations were used.
Fig. 1.Datasets used for this study.The scanning parameters varied between scans and are listed in Table 1. A. Phantom dataset imaged with GE Revolution Evo CT scanner.Data were split sequentially and slice-wise into training, validation, and test sets (5 slices were excluded from both sides of the test slabs).B. Phantom dataset imaged with Siemens SOMATOM Force and Canon Aquilion Prime CT scanners.Full lung volumes were used for testing.C. Clinical test data were retrieved from the Radiological Society of North America Pulmonary Embolism Detection Challenge open dataset and the COVID-19-CT-dataset.Lung volumes were segmented into lobes for analysis.

Table 2
Imaging parameters of the clinical CT studies used in test set 3.