Advertisement

Machine learning framework for automatic image quality evaluation involving a mammographic American College of Radiology phantom

  • Pei-Shan Ho
    Affiliations
    Department of Engineering and System Science, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan
    Search for articles by this author
  • Yi-Shuan Hwang
    Affiliations
    Department of Medical Imaging and Intervention, New Taipei City Municipal TuCheng Hospital, New Taipei City 236, Taiwan

    Department of Medical Imaging & Radiological Sciences, Chang Gung University, No. 259 Wen-Hwa 1st Road, Kwei-Shan, Taoyuan 333, Taiwan
    Search for articles by this author
  • Hui-Yu Tsai
    Correspondence
    Corresponding author at: Institute of Nuclear Engineering and Science, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan.
    Affiliations
    Department of Engineering and System Science, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan

    Institute of Nuclear Engineering and Science, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan
    Search for articles by this author
Published:August 26, 2022DOI:https://doi.org/10.1016/j.ejmp.2022.08.004

      Highlights

      • To establish an automatic image quality evaluation framework using machine learning.
      • The support vector machine-based framework can accurately predict human evaluation.
      • The minimal effective dataset size and the effect of dataset diversity for our model.
      • The influential features for this framework are determined and discussed.

      Abstract

      Purpose

      The image quality (IQ) of mammographic images is essential when making a diagnosis, but the quality assurance process for radiological equipment is subjective. We therefore aimed to design an automatic IQ evaluation architecture based on a support vector machine (SVM) dedicated to evaluating images taken of mammography American College of Radiology (ACR) phantom.

      Methods

      A total of 461 phantom images were acquired using mammographic equipment from 10 vendors. Two experienced medical physicists scored the images by consensus. The phantom datasets were randomly divided into training (80%) and testing (20%) sets. Each phantom image (with 6 fibers, 5 specks, and 5 masses) was detected by using bounding boxes, then cropped and divided into 16 pattern images. We identified 159 features for each pattern image. Manual scores were used to assign 3 labels (visible, invisible, and semivisible) to each pattern image. Multiclass-SVM models were trained with 3 types of patterns. Sub-datasets were randomly selected at 10% increments of the total dataset to determine a minimal effective training subset size for the automatic framework. A feature combination test and an analysis of variance were performed to identify the most influential features.

      Results

      The accuracy of the model in evaluating fiber, speck, and mass patterns was 90.2%, 98.2%, and 88.9%, respectively. The performance was equivalent when the sample size was at least 138 (30% of 461) phantom images. The most influential feature was the position feature.

      Conclusions

      The proposed SVM-based automatic IQ evaluation framework applied to a mammographic ACR phantom accurately matched manual evaluations.

      Keywords

      Introduction

      Mammography is an effective tool for detecting breast cancer in its early stages [
      • Tang J.
      • Liu X.
      Classification of Breast Mass in Mammography with an Improved Level Set Segmentation by Combining Morphological Features and Texture Features.
      ]. High image quality (IQ) is essential for correctly detecting abnormalities [
      • Samei E.
      Medical physics 3.0: A renewed model for practicing medical physics in clinical imaging.
      ]. In the quality assurance process recommended by the American College of Radiology (ACR), the IQ of phantom images is scored manually by experienced medical physicists. The main disadvantages of manual evaluation are its subjectivity and nonrepeatability [
      • Marshall N.W.
      A comparison between objective and subjective image quality measurements for a full field digital mammography system.
      ,
      • Lee J.
      • Nishikawa R.M.
      • Reiser I.
      • Zuley M.L.
      • Boone J.M.
      Lack of agreement between radiologists: implications for image-based model observers.
      ,
      • Manco L.
      • Maffei N.
      • Strolin S.
      • Vichi S.
      • Bottazzi L.
      • Strigari L.
      Basic of machine learning and deep learning in imaging for medical physicists.
      ]. Therefore, in response to demands to ensure high-quality image evaluation, IQ evaluation has frequently been automated [

      Ramos JE, Kim HY, Tancredi F. Automation of the ACR MRI Low-Contrast Resolution Test Using Machine Learning. 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). Beijing, China: IEEE; 2018. p. 1-6.

      ,
      • Gennaro G.
      • Ballaminut A.
      • Contento G.
      A multiparametric automatic method to monitor long-term reproducibility in digital mammography: results from a regional screening programme.
      ,
      • Sun J.
      • Barnes M.
      • Dowling J.
      • Menk F.
      • Stanwell P.
      • Greer P.B.
      An open source automatic quality assurance (OSAQA) tool for the ACR MRI phantom.
      ,
      • Sousa M.
      • Siqueira P.
      • Medeiros R.
      • Schiabel H.
      Automatic evaluation of quality parameters in digital mammography images using the Phantom CDMAM.
      ]. The development of effective automatic IQ evaluation frameworks is a key step for applying this technology in practice.
      Several attempts have been made to automate IQ evaluation. One such approach is pattern recognition. It involves using intensity contrast and morphological detection, and it is the most intuitive method of detecting an object in phantom images in low-contrast resolution tests [
      • Tsai M.-H.
      • Chung C.-T.
      • Wang C.-W.
      • Chan Y.-K.
      • Shen C.-C.
      An automatic contrast-detail phantom image quality figure evaluator in digital radiography.
      ,
      • Ehman M.O.
      • Bao Z.
      • Stiving S.O.
      • Kasam M.
      • Lanners D.
      • Peterson T.
      • et al.
      Automated low-contrast pattern recognition algorithm for magnetic resonance image quality assessment.
      ]. Another approach is to obtain IQ parameters automatically using morphological detection and signal transformation [
      • Sun J.
      • Barnes M.
      • Dowling J.
      • Menk F.
      • Stanwell P.
      • Greer P.B.
      An open source automatic quality assurance (OSAQA) tool for the ACR MRI phantom.
      ,
      • Davids M.
      • Zollner F.G.
      • Ruttorf M.
      • Nees F.
      • Flor H.
      • Schumann G.
      • et al.
      Fully-automated quality assurance in multi-center studies using MRI phantom measurements.
      ,
      • Panych L.P.
      • Chiou J.Y.
      • Qin L.
      • Kimbrell V.L.
      • Bussolari L.
      • Mulkern R.V.
      On replacing the manual measurement of ACR phantom images performed by MRI technologists with an automated measurement approach.
      ,
      • Cohen M.W.
      • Kennedy J.A.
      • Pirmisashvili A.
      • Orlikov G.
      An automatic system for analyzing phantom images to determine the reliability of PET/SPECT cameras.
      ,
      • Alvarez M.
      • Pina D.R.
      • Miranda J.R.
      • Duarte S.B.
      Application of wavelets to the evaluation of phantom images for mammography quality control.
      ]. However, such linear approaches are unable to match the complexity of perception and image processing capability of a human observer. Recently, widely used state-of-the-art machine learning methods such as random forest and support vector machine (SVM) [
      • Cortes C.
      • Vapnik V.
      Support-vector networks.
      ] have been applied to the automation of IQ evaluations [

      Ramos JE, Kim HY, Tancredi F. Automation of the ACR MRI Low-Contrast Resolution Test Using Machine Learning. 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). Beijing, China: IEEE; 2018. p. 1-6.

      ,
      • Sousa M.
      • Siqueira P.
      • Medeiros R.
      • Schiabel H.
      Automatic evaluation of quality parameters in digital mammography images using the Phantom CDMAM.
      ,
      • Pizarro R.A.
      • Cheng X.
      • Barnett A.
      • Lemaitre H.
      • Verchinski B.A.
      • Goldman A.L.
      • et al.
      Automated quality assessment of structural magnetic resonance brain images based on a supervised machine learning algorithm.
      ]. Deep learning algorithms such as convolutional neural networks have also been used to detect lesion in liver lesion phantom analyses [
      • Kopp F.K.
      • Catalano M.
      • Pfeiffer D.
      • Fingerle A.A.
      • Rummeny E.J.
      • Noel P.B.
      CNN as model observer in a liver lesion detection task for x-ray computed tomography: A phantom study.
      ]. However, the performance of automatic IQ evaluation frameworks is still unsatisfactory for practical adoption.
      In machine learning investigations, the prepared dataset should be both large and diverse [
      • Willemink M.J.
      • Koszek W.A.
      • Hardell C.
      • Wu J.
      • Fleischmann D.
      • Harvey H.
      • et al.
      Preparing Medical Imaging Data for Machine Learning.
      ]. However, medical images are often expensive and difficult to obtain. Finding the smallest effective sample size is a classical problem in medicine-related machine learning studies [
      • Castiglioni I.
      • Rundo L.
      • Codari M.
      • Di Leo G.
      • Salvatore C.
      • Interlenghi M.
      • et al.
      AI applications to medical images: From machine learning to deep learning.
      ]. A model’s complexity is a crucial determinant of the minimum sample size required [
      • Raudys S.J.
      • Jain A.K.
      Small sample size effects in statistical pattern recognition: Recommendations for practitioners.
      ]. Generally, nonlinear models require a larger sample size than do linear models, deep learning models are more complex than traditional classifiers are, and a larger feature size (ie, higher dimensionality) corresponds to higher model complexity. Appropriate principal component selection can reduce the dimensionality of a model. Another approach to reducing model complexity is to select appropriate features [
      • Castiglioni I.
      • Rundo L.
      • Codari M.
      • Di Leo G.
      • Salvatore C.
      • Interlenghi M.
      • et al.
      AI applications to medical images: From machine learning to deep learning.
      ,
      • Bolón-Canedo V.
      • Remeseiro B.
      Feature selection in image analysis: a survey.
      ]. Therefore, an analysis was conducted to determine the smallest sample size required for our framework based on the traditional nonlinear classifier of an SVM. Training with a diverse dataset improves a model’s generalizability and therefore its robustness. The effects of heterogeneity in the composition of the training and the testing set are also discussed in this paper.
      The objective of this study was to develop an SVM-based automatic IQ evaluation framework. The key parameters of machine learning algorithms—principal component selection, feature selection, sample size, and data diversity—were carefully considered and discussed.

      Materials and methods

      Phantom

      An ACR-approved CIRS Model 015 Mammographic Accreditation Phantom (CIRS Inc., Norfolk, VA, USA), which meets the requirements of the Mammography Quality Standards Act [
      • Hendrick R.E.
      • Bassett L.W.
      • Botsco M.A.
      • Deihel D.
      • Feig S.
      • Gray J.
      • et al.
      Mammography Quality Control Manual.
      ], was scanned using full-field digital mammography (FFDM) systems to create the image dataset employed in this study. The phantom comprises various structures simulating fibers, specks, and masses in the human breast. A total of 16 patterns (6 fibers, 5 specks, and 5 masses) were embedded in wax, and the wax was inserted in the polymethylmethacrylate-based phantom, shown in Fig. 1(a).
      Figure thumbnail gr1
      Fig. 1The ACR phantom image was acquired through mammography following the ACR-recommended test procedures. The phantoms contained 6 thickness fibers, 5 speck groups, and 5 diameter masses. The left figure shows a typical image that was automatically cropped from the background. The right figure shows 16 pattern images generated by dividing the phantom image.

      Dataset

      Images of the phantom were acquired between January 2017 and August 2019 by using mammographic systems from 10 vendors. Acquisition was performed in accordance with the procedure of the ACR quality control manual for FFDM systems [
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      ]. The details of the machines are in Table 1. A total of 461 phantom images were included in the dataset. Each phantom image was then segmented into 16 pattern images, shown in Fig. 1(b). for a total 7376 pattern images in the dataset.
      Table 1Dataset profiles of imaging system manufacturers along with corresponding system types and image bit depths.
      ManufacturerFrequencySystem typeDetector typeBit depth (bits)
      Hologic147DRa-Se (direct)12 ∼ 14
      GE103DRa-Si (indirect)12 ∼ 14
      Siemens61DRa-Se (direct)12 ∼ 14
      IMS28DRa-Se (direct)13
      Philips6DRPhoton counting16
      Metaltronica2DRa-Se (direct)14
      Fujifilm10DRa-Se (direct)10 ∼ 12
      Fujifilm56CRCR plates10 ∼ 12
      Konica Minolta27CRCR plates12
      Kodak19CRCR plates12
      Agfa Healthcare2CRCR plates12
      Total461
      CR, computed radiography; DR, digital radiography.
      To identify the smallest sample size required for this model, we randomly selected sub-datasets in increments of 10 % of the size of the total dataset. The accuracy and coefficient of variance (COV) were analyzed after sub-datasets were trained and tested. To test the effects of dataset diversity, we selected a specific vendor with the largest number of images (Hologic, n = 147), and we randomly collected a sub-dataset, which had the same size as the Hologic dataset from the full dataset. Three training and testing conditions were implemented: (1) training on the Hologic dataset and testing on the Hologic dataset; (2) training on the Hologic dataset and testing on the sub-dataset; and (3) training on the sub-dataset and testing on the sub-dataset.

      Labeling

      The IQ scores were labeled through consensus by 2 medical physicists (H.Y. Tsai [13 years of experience] and Y.H. Hwang [15 years of experience]). The phantom was scored as detailed in the ACR quality control manual [
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      ]. The readers evaluated the images displayed on a JVC ME315L monitor (3 MP, 1021 shades of gray, and contrast ratio of 900:1) in a low-illumination room (20–45 lx). The window width and window level settings were freely adjustable on the acquisition workstation display monitor to optimize visualization. Visible objects were counted from the largest until a score of 0 or 0.5 was reached. Possible scores ranged from 0 to 6 for fiber and 0 to 5 for speck and mass. To enable comparison of the algorithm and manual scores, two scorers labeled 47 phantom images (10 % of all data) individually, and the interscorer variability was calculated. After the phantom images were converted to pattern images, based on their scores, they were assigned to 3 categories: visible, semivisible, and invisible. For example, a phantom image with a score of 3.5 for fiber would be assigned a label of [1,1,1,0.5,0,0] based on the scoring rules listed in the ACR quality control manual [
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      ]; 1 indicates a visible fiber, 0.5 indicates a semivisible fiber, and 0 indicates that a fiber was not identified.

      Image preprocessing

      The ACR phantom images were acquired as DICOM (Digital Imaging and Communications in Medicine) files. The DICOM images were input to in-house MATLAB (MathWorks, Natick, MA, USA) programs and were automatically cropped and split into a grid of 4 × 4 images.

      Feature extraction

      A total of 159 features were extracted from each pattern image for the machine learning algorithm input. These features were associated with position, global, local, edge, and texture information (Table 2). The position feature indicates the pattern location in the phantom (values from 1 to 16). The global features represented the mean and standard deviation of gray level, the matrix size, and the gradients of each phantom image. The local features included the mean and standard deviation of gray levels inside the signal region of interest (ROI) and background, the pixel size and edges of the signal ROI, the contrast between the signal ROI and background, the contrast-to-noise ratio, the gradients, and the texture information in each pattern image. The signal ROI was automatically detected by an in-house algorithm programmed in MATLAB.
      Table 2Features identified and their categories.
      CategoryFeatures#
      1PositionImage position index
      • Samei E.
      Medical physics 3.0: A renewed model for practicing medical physics in clinical imaging.
      1
      2GlobalGlobal signal
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      1
      Global noise
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      1
      Global area *1
      3Global gradientGlobal gradient
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      36
      4LocalMean inside the signal ROI
      • Samei E.
      Medical physics 3.0: A renewed model for practicing medical physics in clinical imaging.
      1
      Noise inside the signal ROI
      • Samei E.
      Medical physics 3.0: A renewed model for practicing medical physics in clinical imaging.
      1
      Mean outside the signal ROI
      • Samei E.
      Medical physics 3.0: A renewed model for practicing medical physics in clinical imaging.
      1
      Noise outside the signal ROI
      • Samei E.
      Medical physics 3.0: A renewed model for practicing medical physics in clinical imaging.
      1
      Pixels of the signal ROI *1
      Local contrast
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      1
      Contrast to noise ratio
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      1
      5EdgeEdge pixel size *16
      6Local gradientLocal gradient
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      36
      7TextureEntropy
      • Berns E.A.
      • Pfeiffer D.E.
      • Butler P.F.
      • Adent C.
      • Baker J.A.
      • Bassett L.W.
      • et al.
      Digital Mammography Quality Control Manual.
      1
      Local binary pattern
      • Åslund M.
      Digital Mammography with a Photon Counting Detector in a Scanned Multislit Geometry.
      59
      Total159
      # The number of features extracted from each image.
      * Self-defined features.

      SVM algorithm

      A multiclass SVM algorithm using a one-versus-one strategy was implemented to classify the patterns based on 3 labels. Sequential minimal optimization was used to determine the optimal parameters for the classifier model. We performed training and testing with 10-fold cross validation for each condition to avoid overfitting. Fig. 2 summarizes the automatic IQ evaluation framework flowchart.
      Figure thumbnail gr2
      Fig. 2Flowchart of the automatic image quality evaluation framework. The dataset included phantom images and manual scores. In total, 159 features were extracted from each of the 16 pattern images segmented from the mammographic phantom images. Pattern labels were assigned using manual scores. The features and pattern labels were used to train the SVM model, and then the model was tested on the testing set.

      Principal component analysis

      To avoid overfitting and redundancy, principal component analysis (PCA) was performed to reduce the number of model dimensions. We used 2 methods to determine the appropriate number of principal components (R). After grid searching of the models using the first n principal components from 1 to 100, the best accuracy in each pattern was obtained when Rbest principal components were selected. We also identified the Rexp principal components that explained 95 % of the variance. The performance levels of models trained with and without principal components selected were assessed for each type of pattern.

      Influential feature assessment

      A preliminary quantitative evaluation of the correlations between the standard IQ parameters was performed. We calculated the coefficients of correlation between the mean values inside the ROI, the noise inside the ROI, the mean values outside the ROI, the noise outside the ROI, the local contrast, the contrast to noise ratio (CNR), and the labels.
      Two approaches were used to discover influential features to increase computational efficiency and reduce redundancy. In the first approach, 7 categories of features were combined, resulting in a total of 127 (C17+C27++C77) feature combinations. The model was trained and tested on all the combinations to assess the influence of the features. In the other approach, we located the most discriminative features by assigning feature values into 3 groups according to the label of the pattern (visible, semivisible, and invisible) and then testing the equality of the sample means (ie, null hypothesis testing). The discriminative features were identified when the means of the 3 groups were unequal. The model’s performance using discriminative features was assessed and compared in each pattern.

      Statistical analysis

      The agreement between manual and predictive labels was estimated using a weighted Cohen’s kappa. Kappa values ≤ 0, 0.01–0.20, 0.21–0.40, 0.41–0.60, 0.61–0.80, and 0.81–1.00 indicated no, none to slight, fair, moderate, substantial, and almost perfect agreement, respectively. The normality of the distribution was assessed using the Shapiro–Wilk test. Because the accuracies of some models were not normally distributed, the accuracies of different models were tested using the Mann–Whitney U test with a significance level of 0.001.
      A preliminary quantitative evaluation of the standard IQ parameters was performed by calculating Pearson’s correlation coefficient between each pair of parameters.
      In the exploration of discriminative features, the mean equality between 3 groups was estimated through an analysis of variance (ANOVA) test with a significance level of 0.05 for each feature. All statistical analyses were implemented using SPSS 21(SPSS Inc., Chicago, IL, USA).

      Results

      In the automatic IQ framework, the full dataset was trained and tested in 3 types of patterns. The obtained accuracies were 90.2 %, 98.2 %, and 88.9 % for fiber, speck, and mass patterns, respectively. The weighted Cohen’s kappa coefficients were 0.85, 0.97, and 0.83 for fiber, speck, and mass patterns, indicating almost perfect agreement between the manual and predicted labels. However, agreement was lower (kappa less than 0.8) in the interscorer variability assessment. Moreover, kappa values for the algorithm were always 1 because the algorithm output the same results on the same testing data. The variability between the two scorers was greater than that between the consensus and the algorithm. When the labels were converted back to scores, the averaged predictive scores were comparable to the manual scores in the testing set. Most scores deviated by zero points which mean reaching a consensus (n = 37 for fiber, n = 84 for speck, and n = 41 for mass). Many scores deviated by 0.5 points (n = 45 for fiber, n = 8 for speck, and n = 43 for mass), and few scores differed by 1 or more points (n = 10 for fiber, n = 0 for speck, and n = 8 for mass). No difference was observed between predictive and manual scores based on different vendor images (Fig. 3), except for those from Philips and Metaltronica. These inconsistencies may have been caused by the particular imaging technique used by Philips and by an insufficient testing sample size (only one) for Metaltronica.
      Figure thumbnail gr3
      Fig. 3Manual and predictive scores based on different vendor systems in fiber, speck, and mass patterns. The size of the testing set for each vendor is represented as n. The averaged predictive scores (gray lines) were not significantly different from averaged manual scores (dotted lines).
      Fig. 4 shows the accuracies and run time with and without PCA. The accuracies remained the same in all patterns for both principal components with the best accuracy and with 95 % explained variance. However, the training time was reduced substantially after the selection of principal components. In particular, selecting principal components with the best accuracy for the speck pattern resulted in the most efficient training.
      Figure thumbnail gr4
      Fig. 4Accuracies of fiber, speck, and mass pattern recognition with and without PCA (marks) and their corresponding run times (bar). Selecting principle components (PCs) with best accuracy or with 95% explained variance did not affect the accuracy. However, the efficiency increased after PCA was conducted in each pattern.
      Analysis of the correlation between the IQ parameters revealed a strong correlation between the mean pixel values inside and outside the ROI (r = 0.99), the noise inside and outside the ROI (r = 0.90), and the noise inside the ROI and the local contrast (r = 0.83). Labels were slightly correlated with CNR (r = 0.35) but were not correlated with other parameters.
      The performance levels of different feature combinations in the evaluation of influential features are shown in Fig. 5. In this figure, superior accuracy levels are indicated where the combinations included position features. The discriminative features were explored using an ANOVA test between 3 groups for each feature. Among 159 features, 11 in fiber, 35 in speck, and 20 in mass had significantly different sample means between 3 labels. Fig. 6(a) shows the distribution of these discriminative features in the form of a percentage (the number of features divided by the total number of features in each category). The discriminative features were used to train and test the models on each pattern. The accuracies are depicted in the square (Fig. 6(b)). The circles demonstrate that if the position feature was selected with discriminative features, the accuracies were comparable with those when the full feature set was used. These results indicate that the optimal influential features for performance were identified.
      Figure thumbnail gr5
      Fig. 5Accuracies for the model of fiber, speck, and mass patterns with over 128 feature combinations. Feature combinations that included the position feature (solid marks) had superior performance to other combinations.
      Figure thumbnail gr6
      Fig. 6A, The distribution of discriminative features is shown as a percentage for each feature category. B, Accuracies for the recognition of fiber, speck, and mass patterns when discriminative features were selected with and without the position feature. The averaged accuracies of model trained on discriminative features with position feature are comparable to those of the model trained on all features.
      The effect of model sample size is represented in Fig. 7. The accuracies decreased and the COV increased when the sample size was reduced. A noticeable threshold was identified when 30 % of the total data was used; smaller sample sizes resulted in a significant decrease in accuracy and a marked increase in COV.
      Figure thumbnail gr7
      Fig. 7Trends of accuracies and COV in fiber, speck, and mass patterns in relation to the subset sample size. Sample sizes smaller than 30% of the total dataset (138 phantom images) resulted in a significant decrease in accuracy and a significant increase in COV for each type of pattern.
      Fig. 8 illustrates the performance of the model under 3 training and testing conditions. No difference was observed between a model trained and tested on images from a specific-vendor dataset and a model trained and tested on images from the mixed-vendor dataset. However, when the model was trained on a specific-vendor dataset but tested on a mixed-vendor dataset, performance across all patterns was significantly lowered.
      Figure thumbnail gr8
      Fig. 8Accuracies for the model of fiber, speck, and mass patterns when different training and testing set compositions were used. The model had superior accuracy when trained and tested consistently on a specific vendor (Hologic) or on a subset (mixed vendors). However, the accuracy decreased when the model was trained on a specific vendor and tested on a mixed-vendor subset.

      Discussion

      We proposed an automatic IQ evaluation framework based on an SVM. The framework enabled the evaluation of mammographic IQ using an ACR phantom. To mimic human evaluation, the supervised learning model was trained with labels assigned using manually assessed scores. In the testing set, the labels predicted by the model were highly accurate and in almost perfect agreement with the manual evaluation for all 3 patterns. This work demonstrates a novel method of implementing a reliable and nonsubjective automatic IQ evaluation framework, which can be used in prospective quality assurance procedures. Moreover, the principal component selection, feature selection, dataset size, and training set type of the framework were evaluated and discussed in detail.
      The averaged predictive scores and averaged manual scores were matched in all 3 patterns. The manual and algorithm scores deviated by 0.5 points for numerous images; however, this disparity was very common between two manual scorers. This level of deviation is thus acceptable and has limited influence on clinical outcomes. In Philips mammography, unlike in general FFDM, a photon-counting detector is used [
      • Åslund M.
      Digital Mammography with a Photon Counting Detector in a Scanned Multislit Geometry.
      ]. The IQ was slightly lower due to the mismatch between scan speed and slit width [

      Yun S, Kim HK, Youn H, Joe O, Kim S, Park J, et al. Detective quantum efficiency of a silicon microstrip photon-counting detector having edge-on geometry under mammography imaging condition. Journal of Instrumentation. 2011;6:C12006-C. 10.1088/1748-0221/6/12/c12006.

      ,

      Young KC, Strudley CJ. Technical evaluation of Philips MicroDose SI digital mammography system. NHSBSP Equipment Report 1310. London: Public Health England, NHS Cancer Screening Programmes; 2016.

      ]. The modulation transfer function, noise power spectra, and detective quantum efficiency were comparable to FFDM; however, manual scores were lower because the visual perception mechanism was different from that in quantitative IQ evaluations. A machine learning model that captures quantum features to predict images might have superior performance to manual evaluation; however, the difference in visual perception resulted in the Philips system having mismatched scores. Konica mammography uses a computed radiography (CR) system. The IQ performance was lower due to the inferior integration of X-ray generation and image generation in CR units [
      • Brandan M.-E.
      • Ruiz-Trejo C.
      • Cansino N.
      • Cruz-Hernández J.-C.
      • Moreno-Ramírez A.
      • Rodriguez-López J.A.
      • et al.
      Overall performance, image quality, and dose, in computed radiology (CR) mammography systems operating in the Mexican public sector.
      ]. Nevertheless, the SVM-based framework accurately predicted the IQ scores even though the mass scores in Konica systems were worse than in others.
      Model dimension and feature size affect model complexity [
      • Raudys S.J.
      • Jain A.K.
      Small sample size effects in statistical pattern recognition: Recommendations for practitioners.
      ]. To reduce model dimension and improve computational efficiency, PCA was used and increased accuracy in studies with a substantial number of features [
      • Kustner T.
      • Gatidis S.
      • Liebgott A.
      • Schwartz M.
      • Mauch L.
      • Martirosian P.
      • et al.
      A machine-learning framework for automatic reference-free quality assessment in MRI.
      ]. We did not observe a significant increase in accuracy, which may imply that our extracted features were more independent. Nevertheless, PCA substantially increased training efficiency with comparable accuracy. To identify influential features, we first analyzed the correlations between standard IQ parameters and labels; the object signal and background were consistent in terms of average intensity and noise. However, the weak correlation between most IQ parameters and labels indicates that predicting labels through quantitative IQ parameters may be difficult. We also analyzed feature combinations. The results revealed that, of the better performance combinations, most included the position feature. By implementing the hypothesis test of mean equality between the 3 labels on each feature, we also found that the most discriminative features were located in the categories of the global gradient and local gradient for fiber, the global and edge for speck, and the global gradient and texture for mass. With these results, we can achieve comparable performance with fully feature trained models when using only the most discriminative features with the position feature. We define these features as the influential features. We can infer that the influential features provide sufficient information and were highly relevant to the ordinal labels, even though the orders of the means of the influential features were different from those of the labels (data not shown). Therefore, selecting appropriate features prior to model training may reduce the computational burden [
      • Bolón-Canedo V.
      • Remeseiro B.
      Feature selection in image analysis: a survey.
      ] and still enable performance equivalent to that of fully feature trained models.
      The accuracy and generalizability of machine learning are strongly related to the sample size and diversity of the dataset [
      • Willemink M.J.
      • Koszek W.A.
      • Hardell C.
      • Wu J.
      • Fleischmann D.
      • Harvey H.
      • et al.
      Preparing Medical Imaging Data for Machine Learning.
      ]. Two experiments were designed to determine the effect of sample size and dataset type. The required sample size in a machine learning model depends on the model complexity [
      • Raudys S.J.
      • Jain A.K.
      Small sample size effects in statistical pattern recognition: Recommendations for practitioners.
      ]. The framework in the current study does not require as large a sample size as a deep learning model does. We first trained the model using 2766 fiber images and 2305 speck and mass images. The data set sizes were similar to those used by Sousa et al., [
      • Sousa M.
      • Siqueira P.
      • Medeiros R.
      • Schiabel H.
      Automatic evaluation of quality parameters in digital mammography images using the Phantom CDMAM.
      ] who employed 2542 disc images. To determine the smallest necessary sample size, we reduced the sample size linearly by steps of 10 % of the total dataset, and we found decreased accuracy at sample sizes less than 30 % of the total dataset (ie, 828 images in the SVM-based framework). When choosing a training subset, we found that training and testing on images from a single vendor with a small dataset was performed well. However, performance decreased when a model trained with images from a specific vendor was tested with another vendor’s images. This result confirmed that a model trained on a specific vendor exhibits an inherent bias, known as a vendor bias or single-source bias [
      • Willemink M.J.
      • Koszek W.A.
      • Hardell C.
      • Wu J.
      • Fleischmann D.
      • Harvey H.
      • et al.
      Preparing Medical Imaging Data for Machine Learning.
      ]. Thus, to eliminate bias and ensure model generalizability, employing datasets from multiple institutional sources or a large training dataset is recommended. Unfortunately, medical images are usually expensive and costly and thus dataset size is limited. To train an automatic IQ evaluation framework with a limited dataset, a diverse dataset from multiple imaging vendors should be used.
      One limitation of our work was that we only used intuitive spatial features of the images employed in similar studies [

      Ramos JE, Kim HY, Tancredi F. Automation of the ACR MRI Low-Contrast Resolution Test Using Machine Learning. 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). Beijing, China: IEEE; 2018. p. 1-6.

      ,
      • Nakanishi R.
      • Sankaran S.
      • Grady L.
      • Malpeso J.
      • Yousfi R.
      • Osawa K.
      • et al.
      Automated estimation of image quality for coronary computed tomographic angiography using machine learning.
      ]. However, more obscure features related to the frequency domain could also be considered. Another limitation was that the sparsity of semivisible patterns, as identified by phantom scoring, resulted in an imbalance in sample size between labels. Smaller sample sizes reduce statistical power. This limitation could be rectified by matching sample sizes before training.
      In conclusion, we developed an SVM-based automatic IQ evaluation framework for a mammographic ACR phantom that accurately predicted human evaluations. The sample size experiments revealed that 138 (30 % of the total dataset) phantom images is the minimum training sample size for this framework. In the dataset composition experiments, vendor bias was observed when the training subset and testing dataset had different vendors.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      References

        • Tang J.
        • Liu X.
        Classification of Breast Mass in Mammography with an Improved Level Set Segmentation by Combining Morphological Features and Texture Features.
        in: Multi Modality State-of-the-Art Medical Image Segmentation and Registration Methodologies. Volume II. New York, NY, Springer, New York2011: 119-135
        • Samei E.
        Medical physics 3.0: A renewed model for practicing medical physics in clinical imaging.
        Phys Med. 2022; 94: 53-57https://doi.org/10.1016/j.ejmp.2021.12.020
        • Marshall N.W.
        A comparison between objective and subjective image quality measurements for a full field digital mammography system.
        Phys Med Biol. 2006; 51: 2441-2463https://doi.org/10.1088/0031-9155/51/10/006
        • Lee J.
        • Nishikawa R.M.
        • Reiser I.
        • Zuley M.L.
        • Boone J.M.
        Lack of agreement between radiologists: implications for image-based model observers.
        J Med Imaging (Bellingham). 2017; 4025502https://doi.org/10.1117/1.JMI.4.2.025502
        • Manco L.
        • Maffei N.
        • Strolin S.
        • Vichi S.
        • Bottazzi L.
        • Strigari L.
        Basic of machine learning and deep learning in imaging for medical physicists.
        Phys Med. 2021; 83: 194-205https://doi.org/10.1016/j.ejmp.2021.03.026
      1. Ramos JE, Kim HY, Tancredi F. Automation of the ACR MRI Low-Contrast Resolution Test Using Machine Learning. 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). Beijing, China: IEEE; 2018. p. 1-6.

        • Gennaro G.
        • Ballaminut A.
        • Contento G.
        A multiparametric automatic method to monitor long-term reproducibility in digital mammography: results from a regional screening programme.
        Eur Radiol. 2017; 27: 3776-3787https://doi.org/10.1007/s00330-017-4735-x
        • Sun J.
        • Barnes M.
        • Dowling J.
        • Menk F.
        • Stanwell P.
        • Greer P.B.
        An open source automatic quality assurance (OSAQA) tool for the ACR MRI phantom.
        Australas Phys Eng Sci Med. 2015; 38: 39-46https://doi.org/10.1007/s13246-014-0311-8
        • Sousa M.
        • Siqueira P.
        • Medeiros R.
        • Schiabel H.
        Automatic evaluation of quality parameters in digital mammography images using the Phantom CDMAM.
        in: XXIV Brazilian Congress of Biomedical Engineering (CBEB2014). Uberlândia, MG, Brazil. 2014: 181-184
        • Tsai M.-H.
        • Chung C.-T.
        • Wang C.-W.
        • Chan Y.-K.
        • Shen C.-C.
        An automatic contrast-detail phantom image quality figure evaluator in digital radiography.
        Int J Innov Comput. 2012; I: 1063-1075
        • Ehman M.O.
        • Bao Z.
        • Stiving S.O.
        • Kasam M.
        • Lanners D.
        • Peterson T.
        • et al.
        Automated low-contrast pattern recognition algorithm for magnetic resonance image quality assessment.
        Med Phys. 2017; 44: 4009-4024https://doi.org/10.1002/mp.12370
        • Davids M.
        • Zollner F.G.
        • Ruttorf M.
        • Nees F.
        • Flor H.
        • Schumann G.
        • et al.
        Fully-automated quality assurance in multi-center studies using MRI phantom measurements.
        Magn Reson Imaging. 2014; 32: 771-780https://doi.org/10.1016/j.mri.2014.01.017
        • Panych L.P.
        • Chiou J.Y.
        • Qin L.
        • Kimbrell V.L.
        • Bussolari L.
        • Mulkern R.V.
        On replacing the manual measurement of ACR phantom images performed by MRI technologists with an automated measurement approach.
        J Magn Reson Imaging. 2016; 43: 843-852https://doi.org/10.1002/jmri.25052
        • Cohen M.W.
        • Kennedy J.A.
        • Pirmisashvili A.
        • Orlikov G.
        An automatic system for analyzing phantom images to determine the reliability of PET/SPECT cameras.
        in: Proceedings of the ASME Design Engineering Technical Conference. 2015
        • Alvarez M.
        • Pina D.R.
        • Miranda J.R.
        • Duarte S.B.
        Application of wavelets to the evaluation of phantom images for mammography quality control.
        Phys Med Biol. 2012; 57: 7177-7190https://doi.org/10.1088/0031-9155/57/21/7177
        • Cortes C.
        • Vapnik V.
        Support-vector networks.
        Machine learning. 1995; 20: 273-297
        • Pizarro R.A.
        • Cheng X.
        • Barnett A.
        • Lemaitre H.
        • Verchinski B.A.
        • Goldman A.L.
        • et al.
        Automated quality assessment of structural magnetic resonance brain images based on a supervised machine learning algorithm.
        Front Neuroinf. 2016; 10: 52
        • Kopp F.K.
        • Catalano M.
        • Pfeiffer D.
        • Fingerle A.A.
        • Rummeny E.J.
        • Noel P.B.
        CNN as model observer in a liver lesion detection task for x-ray computed tomography: A phantom study.
        Med Phys. 2018; 45: 4439-4447https://doi.org/10.1002/mp.13151
        • Willemink M.J.
        • Koszek W.A.
        • Hardell C.
        • Wu J.
        • Fleischmann D.
        • Harvey H.
        • et al.
        Preparing Medical Imaging Data for Machine Learning.
        Radiology. 2020; 295: 4-15https://doi.org/10.1148/radiol.2020192224
        • Castiglioni I.
        • Rundo L.
        • Codari M.
        • Di Leo G.
        • Salvatore C.
        • Interlenghi M.
        • et al.
        AI applications to medical images: From machine learning to deep learning.
        Phys Med. 2021; 83: 9-24https://doi.org/10.1016/j.ejmp.2021.02.006
        • Raudys S.J.
        • Jain A.K.
        Small sample size effects in statistical pattern recognition: Recommendations for practitioners.
        IEEE Trans Pattern Anal Mach Intell. 1991; 13: 252-264
        • Bolón-Canedo V.
        • Remeseiro B.
        Feature selection in image analysis: a survey.
        Artif Intell Rev. 2019; 53: 2905-2931https://doi.org/10.1007/s10462-019-09750-3
        • Hendrick R.E.
        • Bassett L.W.
        • Botsco M.A.
        • Deihel D.
        • Feig S.
        • Gray J.
        • et al.
        Mammography Quality Control Manual.
        American College of Radiology, Reston, VA1999
        • Berns E.A.
        • Pfeiffer D.E.
        • Butler P.F.
        • Adent C.
        • Baker J.A.
        • Bassett L.W.
        • et al.
        Digital Mammography Quality Control Manual.
        American College of Radiology, Reston, VA2018
        • Åslund M.
        Digital Mammography with a Photon Counting Detector in a Scanned Multislit Geometry.
        KTH Royal Institute of Technology, Stockholm, Sweden2007
      2. Yun S, Kim HK, Youn H, Joe O, Kim S, Park J, et al. Detective quantum efficiency of a silicon microstrip photon-counting detector having edge-on geometry under mammography imaging condition. Journal of Instrumentation. 2011;6:C12006-C. 10.1088/1748-0221/6/12/c12006.

      3. Young KC, Strudley CJ. Technical evaluation of Philips MicroDose SI digital mammography system. NHSBSP Equipment Report 1310. London: Public Health England, NHS Cancer Screening Programmes; 2016.

        • Brandan M.-E.
        • Ruiz-Trejo C.
        • Cansino N.
        • Cruz-Hernández J.-C.
        • Moreno-Ramírez A.
        • Rodriguez-López J.A.
        • et al.
        Overall performance, image quality, and dose, in computed radiology (CR) mammography systems operating in the Mexican public sector.
        in: Bosmans H. Marshall N. Ongeval C.V. 15th International Workshop on Breast Imaging (IWBI2020): SPIE Proceedings. 2020: 8
        • Kustner T.
        • Gatidis S.
        • Liebgott A.
        • Schwartz M.
        • Mauch L.
        • Martirosian P.
        • et al.
        A machine-learning framework for automatic reference-free quality assessment in MRI.
        Magn Reson Imaging. 2018; 53: 134-147https://doi.org/10.1016/j.mri.2018.07.003
        • Nakanishi R.
        • Sankaran S.
        • Grady L.
        • Malpeso J.
        • Yousfi R.
        • Osawa K.
        • et al.
        Automated estimation of image quality for coronary computed tomographic angiography using machine learning.
        Eur Radiol. 2018; 28: 4018-4026https://doi.org/10.1007/s00330-018-5348-8