Free automatic software for quality assurance of computed tomography calibration, edges and radiomics metrics reproducibility

Purpose: To develop a QA procedure, easy to use, reproducible and based on open-source code, to automatically evaluate the stability of different metrics extracted from CT images: Hounsfield Unit (HU) calibration, edge characterization metrics (contrast and drop range) and radiomic features. Methods: The QA protocol was based on electron density phantom imaging. Home-made open-source Python code was developed for the automatic computation of the metrics and their reproducibility analysis. The impact on reproducibility was evaluated for different radiation therapy protocols, and phantom positions within the field of view and systems, in terms of variability (Shapiro-Wilk test for 15 repeated measurements carried out over three days) and comparability (Bland-Altman analysis and Wilcoxon Rank Sum Test or Kendall Rank Correlation Coefficient). Results: Regarding intrinsic variability, most metrics followed a normal distribution (88% of HU, 63% of edge parameters and 82% of radiomic features). Regarding comparability, HU and contrast were comparable in all conditions, and drop range only in the same CT scanner and phantom position. The percentages of comparable radiomic features independent of protocol, position and system were 59%, 78% and 54%, respectively. The non-significantly differences in HU calibration curves obtained for two different institutions (7%) translated in comparable Gamma Index G (1 mm, 1%, > 99%). Conclusions: An automated software to assess the reproducibility of different CT metrics was successfully created and validated. A QA routine proposal is suggested.


Introduction
Computed Tomography (CT) imaging is a consolidated modality for the diagnosis, staging, treatment, and prevention of multiple diseases, consolidated and highly available [1,2].Among all its possible applications, it highlights its relevance in oncology, playing a pivotal role in cancer early diagnosis and monitoring treatment effects [3,4].CT images play a crucial role in radiotherapy planning, allowing the delineation of tumors and organs at risk.Moreover, CT is commonly used for dose calculation [5,6], where the dose of the different organs is calculated from the electronic density obtained from the CT images.
Technical improvements have made possible to extract highthroughput quantitative features from images, known as radiomics [7], allowing data mining and analysis.Radiomic features provide information about tumor shape, microarchitecture, and heterogeneity.Radiomics are used to construct either descriptive or predictive models that facilitate clinical decision making.This is particularly relevant in oncology, where radiomics may thus give important surrogate phenotypic information [8], providing significant data to determine survival and tumor response [9].Nevertheless, radiomics features have proven to be sensitive to variations in acquisition parameters, signal to noise ratios, image processing methods and tumor delineation [9], limiting the generality of prediction models built from radiomic features.
Quality Assurance (QA) protocols are needed to assess CT systems performance, the quality of the obtained images, and the robustness and reproducibility of the different features.Periodic QA tests regarding image quality are recommended in the AAPM TG-66 report [10] and by international authorities such as ICRP and IAEA [11].Moreover, institutions like ICRU and the American College of Radiologists (ACR) emphasize the effectiveness of quantitative tools for evaluating phantom images in QA tests [12,13].Although most CT manufacturers provide their own commercial software for routine QA programs, they rely upon the vendor and require the use of their specific phantoms.In addition, their implemented metrics are rather elementary and far from being clinically relevant.Moreover, the closed software approach does not allow user interaction, denying the possibility of adding new features more suitable for the user requirements.Consequently, multiple opensoftware QA programs have been developed, which require the use of specific phantoms [14][15][16].However, they do not evaluate the reproducibility of the calculated metrics, neither integrate the characterization of tissues interface borders nor radiomic features.
In this study, an open-source software solution is presented for the automation of QA in CT systems, fully developed using Python.It automates the QA process by using reference segmentations.The QA procedure calibrates the HU, characterizes tissue interface borders and extracts radiomic features using the PyRadiomics platform [17].To assess generality, the tool was validated in six CT scanners using two different phantoms.As a proof of concept, the proposed QA software was used to evaluate the dependence of the acquisition protocols, the position of the phantom and the CT scanner on reproducibility.

Experimental phantoms
Two different experimental phantoms were used: the Electron Density Phantom Model 062 M (CIRS) and the Tomotherapy Cheese Phantom (Accuray).Both consist of a cylindrical container with a similar to water electronic density and different holes where inserts simulating different human tissues are placed.

CT scanners
Six different CT systems from two different institutions were used.From Center 1: Philips Gemini TF 64 PET/CT from Nuclear Medicine department (PET/CT-NM); Philips Gemini TF BigBore CT from Radiation Oncology department (CT-RT) and from Diagnostic Radiology department the Philips Brilliance iCT 256 (CT-DR-1) and the Toshiba Aquilion 64 CT (CT-DR-2).From Center 2: Philips Gemini TF PET/CT from Nuclear Medicine department (PET/CT-NM-F) and Philips Brilliance 16 CT from Radiation Oncology department (CT-RT-F).

Protocols
The different protocols evaluated in each PET/CT system can be found in the Supplementary Material (Table S1).

Metrics
The evaluated metrics were divided in three groups.

HU calibration
The software characterizes the electronic density from the measured HU.For this purpose, the reference segmentations are placed in the middle of each one of the inserts of the phantom (Fig. 1a) and a calibration curve relating the physical density and the HU is calculated.

Edges characterization
The reference segmentations cover completely the insert up to the edge (Fig. 1b).The software uses contrast and drop range to characterize the edges.Contrast metric evaluates the intensity difference between the insert and the phantom body calculating an intensity gradient in all the voxels that make up the edge region.Drop Range characterizes how steep the intensity drop is on the edge of the insert.For this purpose, the pixel intensities along four directions on the transversal plane (covering the interface between the insert and the phantom body) are computed to create an intensity profile.Then, an interval was defined taking the pixels where the intensity values laid between the 10% and the 90% of the maximum intensity value of the intensity profile.Thinner intervals represent more defined borders and higher intensity drops.

Radiomic features
The reference segmentations defined for the computation of radiomic features were nine inhomogeneous areas with different electronic densities and two homogeneous areas in air and water (Fig. 1c).A total of 45 different metrics were calculated, classified as First Order metrics, Gray Level Co-occurrence Matrix metrics (GLCM), Gray Level Size Zone Matrix metrics (GLSZM), Gray Level Run Length Matrix metrics (GLRLM) and Neighboring Gray Tone Difference Matrix (NGTDM) radiomic features.

Automatic quality assurance workflow
The software uses a reference image of the phantom to automate the QA process.The tool resizes and rigidly registers the reference image to the new image, saving the result in a transformation matrix.The transformation matrix is subsequently used to transform the reference segmentations to fit to the new images and are then used to calculate the metrics.(Fig. 2).

Reproducibility analysis 2.6.1. Variability
For the intrinsic variability, the CIRS phantom was imaged with the PET/CT-NM system from Center 1 with the protocol commonly used in clinical practice (protocol C from Table S1).The phantom was imaged 5 times per day, 3 different days.The variability of each metric was studied over the 15 acquisitions, evaluating the goodness-of-fit of the data distribution to a normal gaussian distribution, by using the Shapiro-Wilk normality test [18].

Comparability
The comparability of the metrics was analyzed in terms of the implemented protocol, the position of the phantom inside the field of view (FoV) and the CT system used.The protocol C from PET/CT-NM (Table S1) was used as reference protocol.An overview of the assessment of the comparability of the metrics is shown in Fig. 3.
A Bland-Altman analysis [19] was carried out to assess the comparability of the HU calibration.For edge characterization metrics, Kendall Rank Correlation Coefficient (KRCC) [20] was implemented to evaluate the similarity between two ordinal classifications.Finally, both in HU calibration and radiomic features characterization a Wilcoxon Rank Sum Test (WRST) [21] was also carried out.

Gamma index
To assess the effect that the different calibration curves may have in radiotherapy planning the gamma index [22] was calculated for four different radiotherapy plannings in different cancer sites: lung, brain, prostate and head and neck areas.The planning results using the calibration curve from NM department at Center 1 were compared to the ones obtained by implementing the curve from RT department at Center 1 and the curve obtained from Center 2. The dose difference criterion was set to a 1 % and the distance-to-agreement (DTA) to 1 mm.

Results
Our open-source code for automatic CT QA can be downloaded from https://github.com/juandasm/CT_Metrics_Reproducibility.

Intrinsic variability of the metrics
For the segmentations shown in Fig. 1, the code was employed to evaluate if the values of the HU, edge parameters and radiomics features followed a normal distribution across the 15 acquisitions.All acquisitions were performed with the same protocol (C), same CT scanner (PET/CT-NM at Center 1) and same position of the CIRS phantom (center of FoV).Results are shown in Table 1.In 15 out of the 17 inserts, the HU values followed a normal distribution.For edge parameters, both contrast and drop range followed a normal distribution in 5 out of the 8 inserts.From the contrast and drop range values a classification of the inserts was obtained for each measurement; the most repeated classifications are presented in Fig. 4. When comparing classifications by KRCC, both contrast and drop range were comparable.Therefore, classification instead of absolute values is employed in the following sections.Finally, for the 45 radiomic features evaluated, 37 showed a normal distribution in at least 8 of the 11 segmentations.Only these 37 radiomic features will be evaluated in the following sections.

Comparability of metrics
All results are summarized in Table 2.

Protocol dependency
Impact of protocol was evaluated with the PET/CT-NM system at NM department in Center 1 and with the CIRS phantom placed at the center of the FoV.The results derived from protocol C were compared with the other three protocols for the PET/CT-NM system (Table S1).Analyzing the measurement of the HU, both the BA analysis, and the WRST confirmed that all protocols were comparable.Based on these results, a recommended calibration curve was calculated averaging across all protocols of NM.Calibration curves for each protocol and the recommended calibration curve are shown in Fig. 5.The HU calculated with this recommended calibration curve differed in less than a 5% with respect to the HU calculated with the calibration curve obtained for each   protocol.In addition, the edge characterization based on the values of contrast and drop range was comparable for all the protocols, according to the KRCC test.22 out of the 37 radiomic features (59%) were comparable independently of the protocol.

Position dependency
For the same system and protocol (protocol C from PET/CT-NM at Center 1), different positions of the phantom inside the FoV were evaluated, positioning the phantom centered and off-centered.Neither the position of the phantom nor the ring (inner ring or outer ring in CIRS phantom) affected the comparability of the HU and the classification based on contrast values.However, position within the FoV showed a significant effect on drop range, reducing edge-sharpness, with an average 12% decay in the metric value when the phantom was placed off-centered, and the classifications obtained were not comparable based on KRCC, as seen in Table 2. 29 radiomic features (78%) were comparable independently of the position.

Acquisition system dependency
Measurements done with protocol C from NM department at Center 1 were compared with the ones realized with other CT systems at Center 1.For all systems and protocols, HU and edge contrast classification were comparable.Drop range classification was not comparable with other CT systems.Regarding radiomic features, different results were obtained for QA protocols compared to clinical protocols: 20 radiomic features (54%) were found to be comparable along all clinical protocols, compared to 13 radiomic features if QA protocols were implemented.

Comparability of calibration curves for CT systems in different institutions
To prove the feasibility of the proposed method for the comparison of CT performance across institutions, the calibration curve averaged over the four protocols at PET/CT-NM system of Center 1 (recommended calibration curve in Fig. 5) was compared to the calibration curves derived from the CT systems at NM and at RT department in the Center 2. Calibration curves for the two scanners in Center 2 were comparable, based on BA and WRST.Therefore, they were averaged to establish a recommended calibration curve for Center 2. This curve was comparable to the recommended curve at NM in Center 1.However, larger differences were obtained between the calibration curves of the different hospitals (NM (Center 1) and Center 2) with an average value of − 7%, than between the calibration curves within the same institution (NM (Center 1) vs RT (Center 1)), with an average value of − 4%, as shown in Table 3.
Dose difference due to the use of different calibration curves was assessed by calculating the gamma index for 4 different radiotherapy plannings in different cancer sites: brain, prostate, head and neck and lung.The results are shown in Table 4.As it is shown, the effect that the different calibration curves have in radiotherapy planning is negligible, Gamma (1 mm, 1%) > 99%.Nevertheless, differences increased (more points failed) for the calibration curve that showed larger relative differences in HU quantization (Table 3).A Dose-Volume-Histogram for Bronchial Carcinoma implementing the different calibration curves is shown in Fig. 6.

Table 2
Comparability of calibration, edge characterization and radiomics measurements taken using protocol C from NM department at Center 1, used as reference protocol, and comparison between NM and RT department at Center 2. Bold used when the test showed the measurements where not comparable, left in white if they were.

Discussion
This open-sourced automated QA software for CT images is fully developed in Python in conjunction with PyRadiomics for the radiomic features extraction and adapts to multiple phantoms.The software can assess the reproducibility of different CT metrics, including HU calibration, edge characterization and radiomic features.
The use of experimental phantoms is recommended by international committees [10,23,24] within QA programs for CT scanners.As to which phantom to use, there is not an agreement on the standard.Some of the proposed phantoms are the CatPhan Phantom [11,14,25], the American Association of Physicist in Medicine (AAPM) CT Performance Phantom [26][27][28] and the ACR CT Phantom [13,27,29,30].Both ACR CT Phantom and CatPhan Phantom are equipped with dedicated software for image analysis and have been implemented for QA in radiotherapy [29,31], but additional software is required for the radiomics features extraction.Furthermore, none of them has dedicated areas to simulate the electron density of the different tissues present in human anatomy.Other proposals [14][15][16] may feature an automatic QA process for a specific phantom, but do not adapt to other phantoms.The presented QA software is suitable for most phantoms, as it only requires reference segmentations and few adaptations on the software to work with a new phantom.Moreover, in contrast to other free access QA software, the rigid transformation implemented within our code allows the automation of the QA process.Therefore, once there are new images to be analyzed no additional work is needed other than adding them to the data base.
As a proof of concept, the proposed software has been tested with multiple images acquired from different CT systems and two different phantoms.No significant difference was observed between the different protocols regarding HU calibration, which is in line with what has been stated in different articles where neither the changes in the slice thickness [32] nor in the mAs [33] affect the measured HU.HU where comparable in different positions of the phantom, although moving the phantom away from the FoV lead to measurements slightly increasing [34].In the comparison between Center 1 and Center 2, curves were comparable, although with a greater difference between Center 2 and NM department at Center 1 than between RT and NM at Center 1.It must be remarked that, because the phantoms employed were different depending on the institution, different density inserts were evaluated, and it could have contributed to the differences observed in the calibration curves.Moreover, based on the results of Gamma analysis, it could be concluded that calibration curve comparison by WRST and BA is a reasonable criterion for ensuring dose computation comparability.Regarding edges characterization metrics, both are pertinent as they have been used in the literature to study edges [35].However, in our study drop range showed generally a poor comparability.Some studies emphasize the importance of image smoothing before edge detection as they are very noise sensitive [36].This could have affected edge metrics reproducibility since no pre-processing was applied to the images.

Table 3
Percentage deviation in electronic density derived from the recommended curve of the PET/CT-NM system at Center 1 with respect to the values derived from calibration curve for the RT system at Center 1 and the recommended calibration curve (averaged across NM and RT) in Center 2. Previous studies have evaluated radiomic features variability in PET [37][38][39] and MR [40,41] and it has been observed that voxel size showed a significant effect [37], which was also observed in this study.Different QA routines for the evaluation of the CT system performance can be recommended.Firstly, a QA in the case of developing radiomic models with multicenter cohorts is recommended.By applying the software, it can be identified the different radiomic features that are comparable among all the CT systems.It will allow to implement in radiomic models only the most robust metrics, making possible not to mistake radiomic variation due to the equipment and not related with the patient, as in multiple studies CT scanner was found out to be a disruptive parameter in radiomics robustness [42].In Table S2 at Supplementary Material a list of the radiomic features found out to be comparable among all CT scanners in this study is shown.Secondly, a consistency tests is proposed for QA of CT systems.In this case, 3 images should be taken for every clinical protocol.HU are expected to be comparable with respect to the last measurements based on WRST and calibration curves showing relative difference lower than 5%.Regarding edge characterization, contrast classification should be comparable to the last measurements based on KRCC test.If no differences in HU and edge contrast classification have been observed, those radiomic features previously determined to be robust are expected to remain so.We recommend performing the consistency test once per year for CT systems employed in diagnosis and every 6 months for CT systems involved in RT workflow, since in RT accurate HU quantification is needed to correctly compute the dose delivered to the patient and well-defined edges play an important role for an accurate and precise contouring of tumors and their surrounding organs-at-risk.
As a limitation of our study, no pre-processing was applied before metrics calculation and may be of interest to define the effect that image pre-processing could have in metrics reproducibility.Moreover, the CIRS phantom was not stored in facilities with controlled humidity and temperature, which could lead to a small absorption of water.However, while the phantom was not used to take measurements, it was kept in an insulated case proportioned by CIRS, so we expect the effect of humidity to be small.Furthermore, as the CIRS phantom has two different rings and the inner ring moves freely respect to the outer one, if it has not been fixed so that it does not move, a rigid transformation might not be able to successfully register the images and additional dedicated segmentations would be necessary.

Conclusions
An open-source software for the automatic evaluation of the analysis and reproducibility of CT metrics has been developed.It has the capacity to adapt its functioning to multiple phantoms.The viability of the project has been tested with six different CT systems, two phantoms and two positions within the FoV, analyzing the metrics of the acquired images.Based on the obtained results assessing the reproducibility of the metrics different QA routines have been proposed.

Funding
Montserrat Carles was funded by the Conselleria de Sanitat Universal i Salut Pública from the Comunitat Valenciana.The funding sources had no involvement in the writing of the manuscript or in the decision to submit the article for publication.

Fig. 3 .
Fig. 3. Diagram showing the comparability analysis divided in dependency on protocol, position, and CT system.

Fig. 4 .
Fig. 4. Axial view of the CIRS phantom.In black the most repeated classification for contrast is shown.In white the most repeated classification for drop range is represented.

Fig. 5 .
Fig. 5. Calibration curves for all measured protocols from NM at Center 1, recommended calibration curve and measured HU.

Fig. 6 .
Fig. 6.Dose-Volume-Histogram for Bronchial Carcinoma for the same radiotherapy planning and CT image, but with different calibration curves (NM at Center 1, RT at Center 1, and Center 2).

Table 1
Mean value and standard deviation of HU, contrast and drop range for each tissue density.Values in bold if the measurements did not fit to a normal distribution in all the inserts and in white if they did.Measurements carried out with PET/CT-NM from Center 1 implementing protocol C.

Table 4
Gamma results comparing radiotherapy planning from the calibration curve from NM at Center 1 to the planning done with RT department curve and Center 2 curve.Results Gamma represents the percentage of voxels that were comparable and in parenthesis the number of voxels that failed.In Max Gamma the value in parenthesis represents the effect in dose that the different calibration curve has.