TLDR: In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results