Title
题目
Ladder Fine-tuning approach for SAM integrating complementary network
阶梯式微调方法,用于整合补充网络的自适应矩估计(SAM)
01
文献速递介绍
医学图像分割在医疗保健中扮演着至关重要的角色。它旨在使用各种医学成像方式,如X射线、CT扫描、MRI扫描或超声波图像,对包括肝脏、大脑和病变在内的各种身体器官进行分割。因此,它极大地帮助临床医生进行诊断、治疗计划和治疗后监测。在过去的十年里,卷积神经网络(CNN)已经成为广泛应用于各种计算机视觉任务的流行技术。 最近,Long等人提出了全卷积网络(FCN)。这种方法能够处理任何大小的输入图像,并通过用卷积层替换全连接层来生成分割结果。 U-Net,由Ronneberger等人开发,是医学图像分割最广泛使用的架构。它包括一个编码器和一个解码器,其中包含跳跃连接以保留重要特征。编码器路径对输入图像进行下采样的同时捕获高级特征。而解码器路径对特征图进行上采样以预测分割结果。Zhou等人通过引入嵌套跳跃连接方案,扩展了U-Net架构。 这允许捕获多尺度的上下文信息并更好地整合来自不同层次的特征。Chen等人提出了Deeplab系列模型,其中包括空洞/扩张卷积操作和全连接条件随机场的概念。 最近,Transformer[5]被引入到计算机视觉(CV)领域,它最初是为自然语言处理(NLP)设计的。与传统的CNN架构相比,Transformer能够捕获长距离依赖关系。Dosovitskiy等人提出了用于图像分类的Vision Transformer(ViT),采用自注意力机制。随后,Chen等人[7]提出了TransUNet,它使用ViT进行分割任务。TransUNet共同利用CNN和ViT从输入图像中获取局部和全局上下文特征。Tang等人展示了使用ViT模型作为主要编码器进行特征提取的Swin UNETR。
Abstract
摘要
Recently, foundation models have been introduced
demonstrating various tasks in the field of computer vision.
These models such as Segment Anything Model (SAM) are generalized models trained using huge datasets. Currently, ongoing research focuses on exploring the effective utilization of these generalized models for specific domains, such as medical imaging. However, in medical imaging, the lack of training samples due to privacy concerns and other factors presents a major challenge for applying these generalized models to medical image segmentation task. To address this issue, the effective fine tuning of these models is crucial to ensure their optimal utilization. In this study, we propose to combine a com plementary Convolutional Neural Network (CNN) along with the standard SAM network for medical image segmentation. To reduce the burden of fine tuning large foundation model and implement cost-efficient trainnig scheme, we focus only on fine-tuning the additional CNN network and SAM decoder part. This strategy significantly reduces trainnig time and achieves competitive results on publicly available dataset. The code is available at https://github.com/11yxk/SAM-LST.
最近,基础模型在计算机视觉领域的各种任务中展现了其能力。
这些模型,如 Segment Anything Model (SAM),是使用庞大数据集训练的泛化模型。目前, 持续的研究重点在于探索这些泛化模型在特定领域(如医学成像)的有效利用。然而,在医学成像中, 由于隐私问题和其他因素导致的训练样本缺乏,为这些泛化模型在医学图像分割任务中的应用带来了主要挑战。为了解决这一问题, 这些模型的有效微调至关重要,以确保它们的最优利用。在本研究中,我们提议结合一个补充的卷积神经网络(CNN)和标准的SAM网络,用于医学图像分割。为了减轻微调大型基础模型的负担,并实施成本高效的训练方案,我们仅关注于微调附加的CNN网络和SAM解码器部分。
这一策略显著减少了训练时间,并在公开可用的数据集上取得了有竞争力的结果。代码可在 https://github.com/11yxk/SAM-LST 获取。
METHOD
方法
A. Segment Anything Model
The Segment Anything Model (SAM) [13] is the first attempt of foundation models in segmentation task. SAM consists of three components, these are image encoder,prompt encoder and mask decoder. The image encoder employs an MAE pre-trained ViT network [6] to extract image features. The prompt encoder enables four types of prompt inputs: points, boxes, text and masks. The points and boxes are embedded with positional encoding while the text is embedded with text encoder from CLIP .Masks are embedded using convolution operations. The mask decoder is designed to map the image embedding and prompt embedding in a lightweight manner. These two types of embeddings interact using cross-attention module, using one embedding as query and another embedding as key and value vectors. Finally, the Transposed convolutions are used to up sample the features. The mask decoder has the capability to generate multiple results as the provided prompts might have ambiguity. The default number of outputs is set to three. It is worth to mention that the image encoder extracts image features only once for each input image. After that the lightweight prompt encoder and mask decoder can interact with users based on different input prompts in a web browser in real-time. The SAM is trained using more than 11M images and 1B masks. The experimental results demonstrate the superior zero-shot transfer ability. As implied by its name, the model can almost segment anything, even in cases that have not seen before (unseen test samples).
Segment Anything Model(SAM)
是基础模型在分割任务中的首次尝试。SAM由三个部分组成,这些部分包括图像编码器、提示编码器和遮罩解码器。图像编码器采用经过MAE预训练的ViT网络来提取图像特征。提示编码器支持四种类型的提示输入:点、框、文本和遮罩。点和框使用位置编码进行嵌入,而文本则使用CLIP中的文本编码器进行嵌入。遮罩使用卷积操作进行嵌入。遮罩解码器旨在以轻量级方式映射图像嵌入和提示嵌入。这两种类型的嵌入通过交叉关注模块进行交互,使用一个嵌入作为查询,另一个嵌入作为键和值向量。最终,使用转置卷积对特征进行上采样。遮罩解码器具有生成多个结果的能力,因为提供的提示可能存在歧义。默认的输出数量设置为三个。值得一提的是,图像编码器对每个输入图像只提取一次图像特征。之后,轻量级的提示编码器和遮罩解码器可以根据不同的输入提示与用户实时在网页浏览器中进行交互。SAM使用超过1100万张图片和10亿个遮罩进行训练。实验结果展示了其卓越的零样本转移能力。正如其名称所暗示的,这个模型几乎可以分割任何东西,即使是之前未见过的案例(未见测试样本)。
CONCLUSION
结论
We introduce a robust and flexible fine-tuning strategy for large foundation model, specifically SAM. Our proposed approach of integrating CNN encoder while employing a learnable weight parameter achieves a significant result. This approach provides the way for new fine-tuning strategies in computer vision. Furthermore, our proposed approach minimizes resource utilization and reduces training time. In the future, we aim to explore additional fine-tuning methods to enhance performance.
我们介绍了一种针对大型基础模型(特别是SAM)的稳健且灵活的微调策略。我们提出的方法中,集成CNN编码器同时采用可学习的权重参数,取得了显著的结果。这种方法为计算机视觉中新的微调策略提供了途径。此外,我们提出的方法最小化了资源利用并减少了训练时间。未来,我们旨在探索额外的微调方法以提升性能。
Fig
图
Fig. 1. Overview of our proposed method.
图 1. 我们提出的方法概览。
Fig. 2. The architecture of CNN Encoder. In this figure, we omit the activation function, batch normalization layer and residual connections for simplicity
图 2. CNN编码器的架构。在此图中,为简化起见,我们省略了激活函数、批量归一化层和残差连接。
Fig. 3. Segmentation results on Synapse dataset.
图 3. 在Synapse数据集上的分割结果。
Table
表
TABLE Icomparison with state-of-the-arts
表 I 与最先进技术的比较
TABLE II ablation results
表 II消融实验结果