Deep Neural Networks to Detect and Quantify Lymphoma Lesions: Discussion

12 Jun 2024


(1) Shadab Ahamed, University of British Columbia, Vancouver, BC, Canada, BC Cancer Research Institute, Vancouver, BC, Canada. He was also a Mitacs Accelerate Fellow (May 2022 - April 2023) with Microsoft AI for Good Lab, Redmond, WA, USA (e-mail:;

(2) Yixi Xu, Microsoft AI for Good Lab, Redmond, WA, USA;

(3) Claire Gowdy, BC Children’s Hospital, Vancouver, BC, Canada;

(4) Joo H. O, St. Mary’s Hospital, Seoul, Republic of Korea;

(5) Ingrid Bloise, BC Cancer, Vancouver, BC, Canada;

(6) Don Wilson, BC Cancer, Vancouver, BC, Canada;

(7) Patrick Martineau, BC Cancer, Vancouver, BC, Canada;

(8) Franc¸ois Benard, BC Cancer, Vancouver, BC, Canada;

(9) Fereshteh Yousefirizi, BC Cancer Research Institute, Vancouver, BC, Canada;

(10) Rahul Dodhia, Microsoft AI for Good Lab, Redmond, WA, USA;

(11) Juan M. Lavista, Microsoft AI for Good Lab, Redmond, WA, USA;

(12) William B. Weeks, Microsoft AI for Good Lab, Redmond, WA, USA;

(13) Carlos F. Uribe, BC Cancer Research Institute, Vancouver, BC, Canada, and University of British Columbia, Vancouver, BC, Canada;

(14) Arman Rahmim, BC Cancer Research Institute, Vancouver, BC, Canada, and University of British Columbia, Vancouver, BC, Canada.


In this work, we trained and evaluated four distinct neural network architectures to automate the segmentation of lymphoma lesions from PET/CT datasets sourced from three different cohorts. To assess models performance, we conducted comprehensive evaluations on internal test set originating from these three cohorts and showed that SegResNet and UNet outperformed DynUNet and SwinUNETR on the DSC (mean and median) and median FPV metrics, while SwinUNETR had the best median FNV. In addition to internal evaluations, we extended our analysis to encompass an external outof-distribution testing phase on a sizable public lymphoma PET/CT dataset. On this external test set as well, SegResNet emerged as the top performer in terms of DSC and FPV metrics, underscoring its robustness and effectiveness, while UNet displayed the best performance on FNV.

It is important to highlight that SegResNet and UNet were trained on patches of larger sizes, specifically (224, 224, 224) and (192, 192, 192) respectively, while DynUNet and SwinUNETR were trained using relatively smaller patches, namely (160, 160, 160) and (128, 128, 128) respectively. Utilizing larger patch sizes during training allows the neural networks to capture a more extensive contextual understanding of the data, thereby enhancing its performance in segmentation tasks [17]. This observation aligns with our results, where the superior performance of SegResNet and UNet can be attributed to their exposure to larger patch sizes during training. Moreover, larger batch sizes enable robust training by accurately estimating the gradients [17], but with our chosen training patch sizes, we could not train SegResNet, DynUNet and SwinUNETR with nb > 1 due to memory limitations (although we could accommodate nb = 8 for UNet). Hence, for a fair comparison between networks, all networks were trained with nb = 1. It is worth noting that our inability to train DynUNet and SwinUNETR on larger patch and mini-batch sizes was primarily due to computational resource limitations. However, this limitation presents an avenue for future research, where training these models with larger patches and batch sizes could potentially yield further improvements in segmentation accuracy.

We assessed the reproducibility of lesions measures and found that on the internal test set, TMTV and TLG were reproducible across all networks, while Dmax was not reproducible by any network. SUVmean was reproducible by all networks except UNet, SUVmax by only SegResNet and number of lesions by only UNet and SegResNet. On the external test set, reproducibility was more limited, with only SUVmean being reproducible by both SegResNet and SwinUNETR, number of lesions by SegResNet, and TLG by DynUNet (Fig. 3 and 4). Furthermore, we quantified the networks’ error in estimating the value of lesion measures using MAPE and found that MAPE generally decreases as a function of lesion measure values (for all lesion measures) on the combined internal and external test set (Fig. 5). The networks generally made significant errors in the accurate prediction when the ground truth lesion measures were very small. We also showed that, in general, on a set of images with larger patient level lesion SUVmean, SUVmean, TMTV, and TLG, a network is able to predict a higher median DSC, although for very high values of these lesion measures, the performance generally plateaus. On the other hand, the DSC performance is not much affected by the number of lesions, while for a set of images with higher Dmax, the performance generally decreases for all networks (Fig. 7).

As much of PET/CT data is privately owned by healthcare institutions, it poses significant challenges for researchers in accessing diverse datasets for training and testing deep learning models. In such a scenario, to improve the interpretability of models, it is crucial for researchers to investigate how the performance of their models depend on dataset characteristics. By studying how model performance correlates with the image/lesion characteristics, researchers can gain insights into the strengths and limitations of their models [13].

Alongside the evaluation of segmentation performance, we also introduced three distinct detection criteria, denoted as Criterion 1, 2, and 3. These criteria served a specific purpose: to evaluate the networks’ performance on a per-lesion basis. This stands in contrast to the segmentation performance assessment, which primarily focuses on the voxel-level accuracy of the networks. The rationale behind introducing these detection criteria lies in the need to assess how well the networks identify and detect lesions within the images, as opposed to merely evaluating their ability to delineate lesion boundaries at the voxel level. The ability to detect the presence of lesions (Criterion 1) is crucial, as it directly influences whether a potential health concern is identified or missed. Detecting even a single voxel of a lesion could trigger further investigation or treatment planning. Lesion count and accurate localization (Criterion 2) are important for treatment planning and monitoring disease progression. Knowing not only that a lesion exists but also how many there are and where they are located can significantly impact therapeutic decisions. Criterion 3 which focused on segmenting lesions based on lesion metabolic characteristics (SUVmax), adds an additional layer of clinical relevance.

Using these detection metrics, we assessed the sensitivities and FP detections for all networks and showed that depending on the detection criteria, a network can have very high sensitivity even when the DSC performance was low. Given these different detection criteria, a trained model can be chosen based on specific clinical use cases. For example, some use cases might involve being able to detect all lesions without being overly cautious about segmenting exact lesion boundary, while some other use cases might be looking for more robust boundary delineations.

Furthermore, we assessed the intra-observer variability of a physician in segmenting both “easy” and “hard” cases, noting challenges in consistent segmentation of cases from the “hard” subset. In lymphoma lesion segmentation, cases can vary in difficulty due to factors like size, shape, and location of lesions, or image quality. By identifying which cases are consistently difficult for even an experienced physician to segment, we gained insights into the complexities and nuances of the segmentation task. Finally, we also assessed the interobserver agreement between three physicians. Although, we inferred that there was substantial level of agreement between the three physicians, the assessment was performed only on 9 cases, resulting in low statistical power.

To improve the consistency of ground truth in medical image segmentation, a well-defined protocol is essential. This protocol should engage multiple expert physicians independently in delineating regions of interest (ROIs) or lesions within PET/CT images. Instead of a single physician segmenting a cohort independently, multiple annotators should segment the same images without knowledge of each other’s work. Discrepancies or disagreements among physicians can be resolved through structured approaches such as facilitated discussions, clinical information reviews, or image clarification. This robust ground truth process enhances inter-observer agreement accuracy and strengthens the validity of research findings and clinical applications relying on these annotations.

This paper is available on arxiv under CC 4.0 license.