Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harnesses the output features from the final self-attention block. Through extensive experimentation, we demonstrate that the utilization of these diffusion features in a graph based segmentation algorithm, significantly outperforms previous state-of-the-art methods on zero-shot segmentation. Specifically, we leverage a recursive Normalized Cut algorithm that softly regulates the granularity of detected objects and produces well-defined segmentation maps that precisely capture intricate image details. Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks.
As good candidates for a task of unsupervised segmentation are expected to be semantically coherent, we conduct a comparison between different families of foundation models on their internal alignment at the patch-level.
Selected models include text-to-image DMs (SSD-1B), text-aligned contrastive models (CLIP, SigLIP) and self-supervised models (DINO, DINOv2).
We observe that SSD-1B UNet encoder exhibits a greater patch-level alignment than any other candidate model. Deriving from this observation, we assume that a graph-based partitioning algorithm would yield sharp image segments, each corresponding to a precise semantic concept as distinct objects would manifest as weakly connected components in a patch similarity matrix.
In this work, we approach image segmentation as a graph partitioning problem. To allow the segmentation of an arbitrary number of objects, we adopt a recursive normalized cut strategy. First, we extract image features using the diffusion UNet encoder, and then compute an affinity matrix (not shown below). We then apply a single step of the Normalized Cut algorithm to split the graph into two parts. The NCut value, which measures the partitioning cost, is compared to a hyperparameter, \( \tau \) that sets the maximum allowed cost for further partitioning. If this value is below \( \tau \), the partitioning continues recursively on one of the branches. When it exceeds \( \tau \), we register the most recent segment with an NCut value below \( \tau \) as a cluster, and then proceed to partition another branch of the graph.
@inproceedings{couairondiffcut,
title={DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut},
author={Couairon, Paul and Shukor, Mustafa and HAUGEARD, Jean-Emmanuel and Cord, Matthieu and THOME, Nicolas},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}
}