DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut

Our DiffCut method exploits features from a diffusion UNet encoder in a graph-based recursive partitioning algorithm. Compared to DiffSeg, DiffCut provides finely detailed segmentation maps that more closely align with semantic concepts.

Abstract

Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harnesses the output features from the final self-attention block. Through extensive experimentation, we demonstrate that the utilization of these diffusion features in a graph based segmentation algorithm, significantly outperforms previous state-of-the-art methods on zero-shot segmentation. Specifically, we leverage a recursive Normalized Cut algorithm that softly regulates the granularity of detected objects and produces well-defined segmentation maps that precisely capture intricate image details. Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks.

Semantic Coherence in Vision Encoders

As good candidates for a task of unsupervised segmentation are expected to be semantically coherent, we conduct a comparison between different families of foundation models on their internal alignment at the patch-level. Selected models include text-to-image DMs (SSD-1B), text-aligned contrastive models (CLIP, SigLIP) and self-supervised models (DINO, DINOv2).

Qualitative results on the semantic coherence of various vision encoders. We select a patch (red marker) associated to the dog in Ref. Image. Top row shows the cosine similarity heatmap between the selected patch and all patches produced by vision encoders for Ref. Image. Bottom row shows the heatmap between the selected patch in Ref. Image and all patches produced by vision encoders for Target. Image.

We observe that SSD-1B UNet encoder exhibits a greater patch-level alignment than any other candidate model. Deriving from this observation, we assume that a graph-based partitioning algorithm would yield sharp image segments, each corresponding to a precise semantic concept as distinct objects would manifest as weakly connected components in a patch similarity matrix.

Recursive Normalized Cut

In this work, we approach image segmentation as a graph partitioning problem. To allow the segmentation of an arbitrary number of objects, we adopt a recursive normalized cut strategy. First, we extract image features using the diffusion UNet encoder, and then compute an affinity matrix (not shown below). We then apply a single step of the Normalized Cut algorithm to split the graph into two parts. The NCut value, which measures the partitioning cost, is compared to a hyperparameter, \( \tau \) that sets the maximum allowed cost for further partitioning. If this value is below \( \tau \), the partitioning continues recursively on one of the branches. When it exceeds \( \tau \), we register the most recent segment with an NCut value below \( \tau \) as a cluster, and then proceed to partition another branch of the graph.

DiffCut Pipeline

Overview of DiffCut. 1) DiffCut takes an image as input and extracts the features of the last self-attention block of a diffusion UNet encoder. 2) These features are used to construct an affinity matrix that serves in a recursive normalized cut algorithm, which outputs a segmentation map at the latent spatial resolution. 3) A high-resolution segmentation map is produced via a concept assignment mechanism on the features upsampled at the original image size.

Zero-Shot Segmentation Results

BibTeX


        @inproceedings{
          couairon2024diffcut,
          title={DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut},
          author={Paul Couairon and Mustafa Shukor and Jean-Emmanuel HAUGEARD and Matthieu Cord and Nicolas THOME},
          booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
          year={2024},
          url={https://openreview.net/forum?id=N0xNf9Qqmc}
          }