In this article, we will explain what multimodal embeddings are and how they can be used for image retrieval.
The goal of image retrieval is to retrieve the most relevant images to the query image. Imagine we are looking for more pictures depicting a specific geographical location, such as the Gullfoss waterfall in Iceland.
In a standard image retrieval system, our query will be encoded into high-dimensional vectors using image encoders such as ResNet, ViT, etc., and using some similarity metric (e.g., cosine similarity), the most relevant images from a large database of candidate images will be retrieved.

In a multimodal image retrieval system, the goal remains the same, but the query is a combination of the image and some other modality such as text, image metadata, etc.
In our scenario, we'll be working with images taken from the ISS (International Space Station). The goal of our image retrieval system is to retrieve relevant candidate images for our queries, where "relevant" means there is a non-zero overlap between the query and the retrieved images. Briefly, there are three sources of data:
The nadir is the point on Earth's surface directly below the satellite. It is useful metadata for the ISS images, as knowing it can provide a rough estimation of where the image was taken (we can be sure that the image is within ~2500km of the nadir point). The picture below helps illustrate this concept better.

The question remains, however: how to use the metadata to improve retrieval performance? We've opted for a simple approach partially inspired by the recent CVPR paper by Dhakal et al. [1] and the GeoCLIP paper [2]. We use a pretrained DINOv2 foundation model as the image encoder combined with a SALAD aggregator to improve the embeddings' performance. For the metadata encoder, we pass the nadir coordinates (latitude, longitude) through a combination of Fourier Features and standard MLP layers to obtain a high-dimensional embedding. After having both the image and metadata embeddings, we simply concatenate them and use this as our final multimodal embedding. We train using a multi-similarity loss on the concatenated embeddings. The picture below illustrates the architecture better.

We performed two experiments to prove whether multimodal embeddings improve retrieval performance. In the first experiment, no metadata is used—only the image embeddings are used for retrieval. In the second experiment, we use the image embeddings concatenated with the metadata embeddings for both the query and candidate images.
The results are reported on each of the evaluation datasets (Alps, Texas, Toshka Lakes, Amazon, Napa, Gobi) using the standard Recall@K metric. It's worth noting that all 161,496 candidate images are used for retrieval, making the task more challenging. Below is the coverage of the query and candidate images used for retrieval.

| Dataset | Number of Queries | Number of Database Images | R@1 | R@5 | R@100 |
|---|---|---|---|---|---|
| Alps | 2394 | 161496 | 80.8 | 89.7 | 97.3 |
| Texas | 6143 | 161496 | 82.1 | 89.0 | 96.1 |
| Toshka Lakes | 2164 | 161496 | 90.6 | 95.1 | 98.7 |
| Amazon | 682 | 161496 | 77.9 | 85.5 | 93.5 |
| Napa | 3569 | 161496 | 82.5 | 88.9 | 96.4 |
| Gobi | 726 | 161496 | 74.4 | 87.1 | 97.0 |
| Dataset | Number of Queries | Number of Database Images | R@1 | R@5 | R@100 |
|---|---|---|---|---|---|
| Alps | 2394 | 161496 | 78.8 | 87.2 | 96.4 |
| Texas | 6143 | 161496 | 80.1 | 86.8 | 95.1 |
| Toshka Lakes | 2164 | 161496 | 90.1 | 94.3 | 98.4 |
| Amazon | 682 | 161496 | 73.5 | 82.3 | 91.6 |
| Napa | 3569 | 161496 | 80.6 | 87.4 | 95.2 |
| Gobi | 726 | 161496 | 74.7 | 85.3 | 96.0 |
We observed on average a 2-3% increase in the Recall@K metric across all datasets when using multimodal embeddings. The most significant improvement was observed on the Amazon dataset, which may be due to the fact that the nadir coordinates provide better context for the image, as the Amazon rainforest is a large area and the images taken from the ISS may not have enough distinctive features to be easily retrievable using only image embeddings.
We believe that image retrieval can see significant performance improvements when using multimodal embeddings. This article may be seen as a proof of concept—there are many ways to improve performance further.
[1] Dhakal, A., Sastry, S., Khanal, S., Ahmad, A., Xing, E., & Jacobs, N. (2025). RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings. arXiv preprint arXiv:2502.19781.
[2] Vivanco, V., Nayak, G. K., & Shah, M. (2023). GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. In Advances in Neural Information Processing Systems.
[3] Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.-Y., Xu, H., Sharma, V., Li, S.-W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193.
[4] Izquierdo, S., & Civera, J. (2024). Optimal Transport Aggregation for Visual Place Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
All models were trained on a single NVIDIA H100 100GB GPU. We used the DINOv2-based model with register tokens and SALAD aggregator. The learning rate was set to 1e-4. The models were trained for 20 epochs with a batch size of 96. For more details, consult the GitHub repository here.