Zero-Shot Image Segmentation via Recursive Normalized Cut on Diffusion Features

1Sorbonne Université, 2Thales SIX GTS France,

Our DiffCut method exploits features from a diffusion UNet encoder in a graph-based recursive partitioning algorithm. Compared to DiffSeg, DiffCut provides finely detailed segmentation maps that more closely align with semantic concepts.


Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harnesses the output features from the final self-attention block. Through extensive experimentation, we demonstrate that the utilization of these diffusion features in a graph based segmentation algorithm, significantly outperforms previous state-of-the-art methods on zero-shot segmentation. Specifically, we leverage a recursive Normalized Cut algorithm that softly regulates the granularity of detected objects and produces well-defined segmentation maps that precisely capture intricate image details. Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks.

Semantic Coherence in Vision Encoders

As good candidates for a task of unsupervised segmentation are expected to be semantically coherent, we conduct a comparison between different families of foundation models on their internal alignment at the patch-level. Selected models include text-to-image DMs (SSD-1B), text-aligned contrastive models (CLIP, SigLIP) and self-supervised models (DINO, DINOv2).

key observation

Qualitative results on the semantic coherence of various vision encoders. We select a patch (red marker) associated to the dog in Ref. Image. Top row shows the cosine similarity heatmap between the selected patch and all patches produced by vision encoders for Ref. Image. Bottom row shows the heatmap between the selected patch in Ref. Image and all patches produced by vision encoders for Target. Image.

We observe that SSD-1B UNet encoder exhibits a greater patch-level alignment than any other candidate model. Deriving from this observation, we assume that a graph-based partitioning algorithm would yield sharp image segments, each corresponding to a precise semantic concept as distinct objects would manifest as weakly connected components in a patch similarity matrix.


diffcut pipeline

Overview of DiffCut. 1) DiffCut takes an image as input and extracts the features of the last self-attention block of a diffusion UNet encoder. 2) These features are used to construct an affinity matrix that serves in a recursive normalized cut algorithm, which outputs a segmentation map at the latent spatial resolution. 3) A high-resolution segmentation map is produced via a concept assignment mechanism on the features upsampled at the original image size.

Zero-Shot Segmentation Results

Here we provide qualitative zero-shot segmentation results. For more analysis, please refer to the paper.

unsupervised results


