^{1}

^{1}institutetext: College of Information Science and Electronic Engineering, Zhejiang University

^{2}

^{2}institutetext: Ningbo Innovation Center, Zhejiang University

^{3}

^{3}institutetext: HAOMO.AI Technology Co., Ltd.

^{4}

^{4}institutetext: Zhejiang GongShang University

^{0}

^{0}footnotetext:

^{∗}Equal Contributions.

^{†}Corresponding author. (cao_siyuan@zju.edu.cn)

Runmin Zhang^{∗}11 Jun Ma^{∗}2211 Si-Yuan Cao^{∗}^{†}2211 Lun Luo33

Beinan Yu11 Shu-Jie Chen44 Junwei Li11 Hui-Liang Shen11

###### Abstract

We propose a novel unsupervised cross-modal hom*ography estimation framework based on intra-modal Self-supervised learning, Correlation, and consistent feature map Projection, namely SCPNet. The concept of intra-modal self-supervised learning is first presented to facilitate the unsupervised cross-modal hom*ography estimation. The correlation-based hom*ography estimation network and the consistent feature map projection are combined to form the learnable architecture of SCPNet, boosting the unsupervised learning framework. SCPNet is the first to achieve effective unsupervised hom*ography estimation on the satellite-map image pair cross-modal dataset, GoogleMap, under [-32,+32] offset on a $128\times 128$ image, leading the supervised approach MHN by 14.0% of mean average corner error (MACE). We further conduct extensive experiments on several cross-modal/spectral and manually-made inconsistent datasets, on which SCPNet achieves the state-of-the-art (SOTA) performance among unsupervised approaches, and owns 49.0%, 25.2%, 36.4%, and 10.7% lower MACEs than the supervised approach MHN. Source code is available at https://github.com/RM-Zhang/SCPNet.

###### Keywords:

hom*ography estimation Unsupervised learning Multi-modal and multi-spectral images

## 1 Introduction

hom*ography estimation aims to compute the global perspective transform among images. Present supervised hom*ography estimation approaches [12, 28, 45, 37, 6, 8] can usually handle the hom*ography estimation task under large offsets and modality gaps. However, in real applications, the hom*ography deformation between images is usually unknown, especially for the cross-modal images captured by different devices or at various times [40, 11]. Therefore, unsupervised cross-modal hom*ography estimation is vital for real-world tasks such as multi-spectral image fusion [43, 46], multi-modal image restoration [33, 13], and GPS denied navigation [19, 45].

For the above reasons, unsupervised deep hom*ography estimation has raised growing interest. Nguyen *et al*. [35] trained a deep hom*ography estimation network in an unsupervised manner by comparing the pixel intensity of warped source image and target image. Wang *et al*. [38] constrained the intensity loss in a cyclic manner that further improves the hom*ography estimation accuracy. To cope with the illumination change, several works have been presented [44, 41] to achieve unsupervised hom*ography estimation by introducing the feature representation for both hom*ography estimation and consistency supervision. Based on the above two works, Hong *et al*. [23] adopted the GAN [20] model to improve the supervision of feature similarity. However, most of the above approaches focus only on cross-modal intensity-based learning, and can only separately address either large offsets or modality gaps [27].

To cope with the above problem, in this paper, we propose a novel unsupervised cross-modal hom*ography estimation framework, namely SCPNet, which adopts intra-modal Self-supervised learning, Correlation, and consistent feature map Projection. As illustrated in Fig. 1, different from the previous unsupervised works that only adopt cross-modal intensity-based learning, SCPNet introduces intra-modal self-supervised learning as extra supervision and has a special architecture based on correlation and consistent feature map projection. It is observed that SCPNet achieves successful unsupervised hom*ography estimation on the cross-modal data under such large offsets, while the others cannot.

The intra-modal self-supervised learning lays the foundation of our SCPNet, which mines the two-branch self-supervised information via applying simulated hom*ography within the two modalities. The network with shared weights is trained simultaneously by the two-branch self-supervised learning. According to the ablation on GoogleMap, simply using the intra-modal self-supervised learning, our SCPNet can produce converged training, even without the cross-modal intensity-based learning in [35, 38, 44, 41]. On the contrary, only using the cross-modal intensity-based learning fails to converge on such a large modality gap and offset. The two learning strategies are combined to form the final supervision of SCPNet. The correlation and consistent feature map projection have been separately employed in many previous hom*ography estimation frameworks [37, 6, 44, 41, 23]. However, the strategy and effectiveness of combining them to form an effective unsupervised cross-modal hom*ography estimation framework haven’t ever been investigated. The above two parts form the powerful learnable architecture of our SCPNet, which also boosts the unsupervised training framework.

To the best of our knowledge, SCPNet is the first method that achieves effective unsupervised hom*ography estimation on such a large offset (offset range of [-32,+32] on a $128\times 128$ image) and modality gap (GoogleMap [45] of satellite-map image pairs as in Fig. 1), outperforming the supervised approach MHN [28] by 14.0% of mean average corner error (MACE). We further evaluate our SCPNet on Flash/no-flash [21] cross-modal dataset, Harvard [10] and RGB/NIR [5] cross-spectral datasets, and PDS-COCO [27] manually-made inconsistent dataset, which also achieves the state-of-the-art (SOTA) performance among unsupervised approaches. In summary, our contributions are as follows:

- •
We propose SCPNet, a novel unsupervised cross-modal hom*ography estimation framework, which combines three key components, including intra-modal self-supervised learning, correlation, and consistent feature map projection. SCPNet ranks top in the unsupervised hom*ography estimation on cross-modal/spectral and manually-made inconsistent data under large offsets.

- •
The concept of intra-modal self-supervised learning is devised to support the unsupervised learning framework, which mines the two-branch self-supervised information via applying simulated hom*ography within the two modalities. By simultaneously training the weight-shared network using the two-branch self-supervised learning, the hom*ography estimation knowledge can be generalized from intra-modal to cross-modal.

- •
We combine the correlation and consistent feature map projection to form a powerful unsupervised learning network architecture of SCPNet. The correlation constrains the network to learn a clearer knowledge that can be generalized from intra-modal to cross-modal. The projected consistent feature map can monitor both the cross-modal hom*ography estimation and cross-modal consistent latent space projection, which will further improve the estimation accuracy.

## 2 Related Work

Traditional Approaches. The most widely used traditional hom*ography estimation approaches, namely feature-based approaches, typically involve three key steps: feature extraction, feature matching, and hom*ography estimation [37]. Commonly used feature extraction approaches include SIFT [32], SURF [4], and ORB [34]. Popular hom*ography estimation techniques include DLT [16], RANSAC [17], IRLS [22], and MAGSAC [3]. To further improve the robustness of cross-modal feature extraction, some approaches such as LGHD [1], RIFT [29], and DASC [26] have been presented. The above approaches achieve reliable hom*ography estimations under moderate intensity variance and deformation but may produce unsatisfactory results dealing with cross-modal images under large offsets [7, 6, 28, 37].

Supervised Approaches. DeTone *et al*. [12] first introduced the end-to-end hom*ography estimation network DHN. To further improve the accuracy of hom*ography estimation, many approaches have been subsequently presented. For example, MHN [28] used a multi-scale network concatenation and DLKFM [45] adopted deep Lucas-Kanade iteration. Furthermore, LocalTrans [37] trained a multi-scale local transformer, IHN [6] employed deep learnable iteration, and RHWF [8] combined hom*ography-guided image warping and focus transformer, *etc*. However, obtaining the ground-truth is often difficult and costly, making it challenging for supervised learning approaches to be widely applicable in practice.

Unsupervised Approaches. Nguyen *et al*. [35] trained the hom*ography estimation network using pixel-level photometric loss in an unsupervised manner. Based on this pioneering work, Wang *et al*. [38] added extra supervision by the invertibility constraints. Zhang *et al*. [44] presented CA-UDHN to depict the similarity in feature space instead of pixel space. However, CA-UDHN has poor robustness for images with large viewpoint changes [27]. Koguchiuk *et al*. [27] then expanded CA-UDHN with perceptual loss [25], which improves the robustness of unsupervised training of deep hom*ography estimation under large intensity and viewpoint changes. Furthermore, Ye *et al*. [41] introduced feature identity loss to enforce the image feature to be warp-equivalent, and proposed a hom*ography flow representation. Besides the above approaches, some unsupervised techniques such as NeMAR [2], UMF-CMGR [14], and RFNet [39] use modality transfer networks to migrate one modality to another, achieving unsupervised cross-modal/spectral motion estimation.

## 3 Pilot Experiments and Finding

We first denote the image pair from modality A and B as $\mathbf{I}_{\mathrm{A}}$ and $\mathbf{I}_{\mathrm{B}}$, with hom*ography deformation between them. To train a hom*ography estimation network in a supervised manner, the objective of network training can be formulated as

$\mathop{\arg\min}_{\theta}\mathcal{L}_{\mathrm{S}}\big{(}\phi_{\theta}(\mathbf%{I}_{\mathrm{A}},\mathbf{I}_{\mathrm{B}}),\mathbf{H}_{\mathrm{GT}}\big{)},$ | (1) |

where $\mathbf{H}_{\mathrm{GT}}$ denotes the ground-truth hom*ography between the two images, $\phi_{\theta}$ the network, and $\theta$ the network parameters to be optimized. $\mathcal{L}_{\mathrm{S}}$ denotes the supervised loss, which is usually $L_{2}$ [28] or $L_{1}$ [37, 6] norm. However, in practical applications, ground-truth hom*ography is generally difficult to obtain, especially for the cross-modal images captured by different devices or at various times [40, 11]. To cope with this difficulty, unsupervised hom*ography estimation is then investigated, and the training for most of them [35, 44, 27, 23] can be modeled as

$\mathop{\arg\min}_{\theta}\mathcal{L}_{\mathrm{C}}\big{(}\mathbf{I}_{\mathrm{A%}},\mathcal{W}(\mathbf{I}_{\mathrm{B}},\phi_{\theta}(\mathbf{I}_{\mathrm{A}},%\mathbf{I}_{\mathrm{B}}))\big{)},$ | (2) |

where $\mathcal{W}$ denotes the warping operation using the predicted hom*ography $\phi_{\theta}(\mathbf{I}_{\mathrm{A}},\mathbf{I}_{\mathrm{B}})$, and $\mathcal{L}_{\mathrm{C}}$ denotes the cross-modal loss that monitoring the content similarity of the warped $\mathbf{I}_{\mathrm{B}}$ and $\mathbf{I}_{\mathrm{A}}$. The cross-modal intensity-based loss varies from the $L_{1}$ pixel-wise photometric loss [35], the $L_{1}$ similarity loss of the feature maps [44, 41], and the perceptual loss [27]. Nevertheless, under large hom*ography deformation and intensity variance, the above losses may fail, according to [27] and our experiments. As the cross-modal image intensity similarity is generally highly non-convex [7], making the solution space of the loss function hard to optimize, the training process is prone to fall into non-convergent as demonstrated in [27].

Inspired by multitask learning [9], which simultaneously tackles multiple tasks using a shared representation, we propose intra-modal self-supervised learning to achieve better supervision. Multitask learning implicitly learns task relationships within a shared representation through gradient aggregation [9], and recent studies have shown that various tasks benefit from it [15, 24, 18]. The motivation of our intra-modal self-supervised learning is to enhance the unsupervised learning process by introducing highly related extra tasks that provide direct supervision. While obtaining cross-modal ground-truth hom*ography is challenging, intra-modal ground-truth hom*ography can be easily generated by directly applying simulated deformations [12]. This allows the knowledge of hom*ography transformation to be directly learned, rather than indirectly as in common cross-modal intensity-based learning. Additionally, the relationship between images from two modalities, such as mutual structures, is likely to be learned within the shared representation during two-branch intra-modal self-supervised learning. To validate the aforementioned statement, we train a weight-shared network to separately predict the hom*ography within the two modalities under direct supervision from simulation, which can be expressed as

$\mathop{\arg\min}_{\theta}\mathcal{L}_{\mathrm{S}}\big{(}\phi_{\theta}(\mathbf%{I}_{\mathrm{A}},\mathbf{I}^{\prime}_{\mathrm{A}}),\mathbf{H}_{\mathrm{GT,A}}%\big{)}+\mathcal{L}_{\mathrm{S}}\big{(}\phi_{\theta}(\mathbf{I}_{\mathrm{B}},%\mathbf{I}^{\prime}_{\mathrm{B}}),\mathbf{H}_{\mathrm{GT,B}}\big{)},$ | (3) |

where $\mathbf{I}^{\prime}_{\mathrm{A}}$ denotes the hom*ography warped $\mathbf{I}_{\mathrm{A}}$ with the simulated ground-truth hom*ography $\mathbf{H}_{\mathrm{GT,A}}$, and modality B in the same manner. We conduct this pilot experiment on the cross-modal dataset, GoogleMap [45], which is of large intensity and content difference, under [-32,+32] offset on a $128\times 128$ image. The cross-modal test MACEs of the network trained using intra-modal self-supervised learning and common cross-modal intensity-based learning during the training iterations are illustrated in Fig. 2. Interestingly, we find that the network trained by intra-modal self-supervised learning has an evidently better cross-modal performance than the common cross-modal intensity-based learning trained one. Therefore, we can obtain the finding: The cross-modal hom*ography estimation can be indirectly facilitated by training the weight-shared network using the simulated transform within the two modalities.

## 4 SCPNet

Based on the finding in Section 3, we hope to further design a network architecture and complement it with an appropriate training strategy. For this purpose, we propose the unsupervised cross-modal hom*ography estimation framework that adopts intra-modal Self-supervised learning, Correlation, and consistent feature map Projection, namely SCPNet. Fig. 3(a) demonstrates the schematic diagram of the training and inference framework of SCPNet. Fig. 3(b) and 3(c) show the two learnable modules that form the architecture of SCPNet. Considering that the learnable modules and the training strategy are coupled and mutually promote each other, in the following section, we will follow the idea of building a powerful unsupervised learning framework based on the finding of intra-modal self-supervised learning to demonstrate our SCPNet.

### 4.1 Correlation-based hom*ography Estimation Network

Similar to the previous unsupervised network architectures [35, 38, 44, 41, 27], the finding in Section 3 is obtained by concatenating the image pairs in the channel dimension and expecting the network to directly predict the hom*ography. The knowledge of hom*ography estimation is implicitly learned without any constraint or hint, and hence the potential of our intra-modal self-supervised learning might not be fully explored.

With the above consideration, we alter to construct the hom*ography estimation network with correlation. The architecture of the correlation-based hom*ography estimation network is illustrated in Fig. 3(b). The feature extractor with shared weights produces the features of the two modalities, namely $\mathbf{F}_{\mathrm{A}}$ and $\mathbf{F}_{\mathrm{B}}$. The correlation is realized by computing the inner product of $\mathbf{F}_{\mathrm{A}}$ and $\mathbf{F}_{\mathrm{B}}$ around a local area, which can be expressed as

$\displaystyle\mathbf{C}(\mathbf{x},\mathbf{r})=\mathrm{ReLU}(\mathbf{F}_{%\mathrm{A}}(\mathbf{x})^{\mathsf{T}}\mathbf{F}_{\mathrm{B}}(\mathbf{x}+\mathbf%{r})),\ \ \ \ \|\mathbf{r}\|_{\infty}\leq R$ | (4) |

where $R$ controls the radius of each local area. The correlation is then sent into the hom*ography estimator to conduct the hom*ography prediction. The structural details of the hom*ography estimation network can be found in the supplementary material. By adopting correlation, the hom*ography estimation network is clarified into the weight-shared feature extractor, correlation computation, and the hom*ography estimator. Under the intra-modal self-supervised learning, each of the above parts is constrained to learn a clearer knowledge that can be generalized to cross-modal: 1) the feature extractor is constrained to produce feature representations that are effective for correlation, which is indirectly enforced to share the intra-modal self-supervised knowledge to cross-modal; 2) the similarity of features is explicitly encoded by correlation, which is unified among each modality; 3) the knowledge of hom*ography decoding is strictly defined by the correlation input, which is also unified. The intra-modal self-supervised learning of the network can then be formulated by

$\mathop{\arg\min}_{\xi}\mathcal{L}_{\mathrm{S}}\big{(}\psi_{\xi}(\mathbf{I}_{%\mathrm{A}},\mathbf{I}^{\prime}_{\mathrm{A}}),\mathbf{H}_{\mathrm{GT,A}}\big{)%}+\mathcal{L}_{\mathrm{S}}\big{(}\psi_{\xi}(\mathbf{I}_{\mathrm{B}},\mathbf{I}%^{\prime}_{\mathrm{B}}),\mathbf{H}_{\mathrm{GT,B}}\big{)},$ | (5) |

where $\psi_{\xi}$ denotes the correlation-based hom*ography estimation network, with the parameters to be optimized by $\xi$.

### 4.2 Consistent Feature Map Projector

After introducing intra-modal self-supervised learning, we then consider bringing in valid cross-modal supervision to further improve the estimation accuracy. As discussed in Section 3, directly applying intensity-based supervision in Eq. 2 on cross-modal images with severe content differences is infeasible. To cope with the problem, we introduce consistent feature map projection to assist the intensity-based cross-modal supervision, which projects the input images from an intensity-variant space to an intensity-invariant latent space. The architecture of the consistent feature map projector is illustrated in Fig. 3(c). The input image is processed with a convolutional block of kernel size $3\times 3$ first. The produced feature map is then processed by two residual blocks. Finally, the feature map with a higher number of channels is projected into the consistent feature map of $1$ channel by a $1\times 1$ convolutional block.

Boosted by the consistent feature map projector, the cross-modal intensity-based training can be expressed as

$\mathop{\arg\min}_{\xi,\zeta}\mathcal{L}_{\mathrm{C}}\big{(}\delta_{\zeta}(%\mathbf{I}_{\mathrm{A}}),\mathcal{W}(\delta_{\zeta}(\mathbf{I}_{\mathrm{B}}),%\psi_{\xi}(\delta_{\zeta}(\mathbf{I}_{\mathrm{A}}),\delta_{\zeta}(\mathbf{I}_{%\mathrm{B}})))\big{)},$ | (6) |

where $\delta_{\zeta}$ denotes the consistent feature map projector, with $\zeta$ denoting its learnable parameters. We note that, with the consistent feature map projector, the cross-modal intensity-based learning not only supervises the cross-modal hom*ography estimation but also makes the projected feature maps as similar as possible, which will further boost the estimation accuracy.

### 4.3 Training/Inference Framework

Now that we have separately introduced the intra-modal self-supervised learning, the correlation-based hom*ography estimation network, and the consistent feature map projector with cross-modal intensity-based learning. The complete framework of SCPNet can be determined by combining the above learning strategies and modules. The training framework of SCPNet contains two self-supervised learning branches and one cross-modal learning branch, which is the most significant difference compared to the previous approaches. The three branches apply simultaneously supervision on the weight-shared learnable modules as demonstrated in Fig. 3(a). For better illustration, the projected consistent feature maps are denoted by $\mathbf{P}_{\mathrm{A}}=\delta_{\zeta}(\mathbf{I}_{\mathrm{A}})$, $\mathbf{P}_{\mathrm{B}}=\delta_{\zeta}(\mathbf{I}_{\mathrm{B}})$, and the warped $\mathbf{P}_{\mathrm{B}}$ by$\mathbf{P}_{\mathrm{B,W}}=\mathcal{W}(\delta_{\zeta}(\mathbf{I}_{\mathrm{B}}),%\psi_{\xi}(\delta_{\zeta}(\mathbf{I}_{\mathrm{A}}),\delta_{\zeta}(\mathbf{I}_{%\mathrm{B}})))$. The predicted cross-modal hom*ography is denoted by $\mathbf{H_{\mathrm{AB}}=\psi_{\xi}(\delta_{\zeta}(\mathbf{I}_{\mathrm{A}}),%\delta_{\zeta}(\mathbf{I}_{\mathrm{B}}))}$, and intra-modal ones by $\mathbf{H}_{\mathrm{A}}=\psi_{\xi}(\delta_{\zeta}(\mathbf{I}_{\mathrm{A}}),%\delta_{\zeta}(\mathbf{I}^{\prime}_{\mathrm{A}}))$ and $\mathbf{H}_{\mathrm{B}}=\psi_{\xi}(\delta_{\zeta}(\mathbf{I}_{\mathrm{B}}),%\delta_{\zeta}(\mathbf{I}^{\prime}_{\mathrm{B}}))$. As mentioned in Section 3, for the two self-supervised branches, the input $\mathbf{I}_{\mathrm{A}}$ and $\mathbf{I}_{\mathrm{B}}$ are separately deformed and trained under the direct supervision of the simulated hom*ography $\mathbf{H}_{\mathrm{GT,A}}$ and $\mathbf{H}_{\mathrm{GT,B}}$. Meanwhile, cross-modal intensity-based learning is conducted by applying supervision on the projected consistent feature map $\mathbf{P}_{\mathrm{A}}$ and warped one $\mathbf{P}_{\mathrm{B,W}}$. The correlation-based hom*ography estimation network and consistent feature map projector are both absorbed to form the network. Finally, the entire unsupervised cross-modal learning framework can be formulated as

$\displaystyle\mathop{\arg\min}_{\xi,\zeta}$ | $\displaystyle\mathcal{L}_{\mathrm{C}}\big{(}\delta_{\zeta}(\mathbf{I}_{\mathrm%{A}}),\mathcal{W}(\delta_{\zeta}(\mathbf{I}_{\mathrm{B}}),\psi_{\xi}(\delta_{%\zeta}(\mathbf{I}_{\mathrm{A}}),\delta_{\zeta}(\mathbf{I}_{\mathrm{B}})))\big{)}$ | (7) | ||

$\displaystyle+$ | $\displaystyle\lambda\mathop{\mathcal{L}_{\mathrm{S}}}\big{(}\psi_{\xi}(\delta_%{\zeta}(\mathbf{I}_{\mathrm{A}}),\delta_{\zeta}(\mathbf{I}^{\prime}_{\mathrm{A%}})),\mathbf{H}_{\mathrm{GT,A}}\big{)}_{\ }$ | |||

$\displaystyle+$ | $\displaystyle\lambda\mathop{\mathcal{L}_{\mathrm{S}}}\big{(}\psi_{\xi}(\delta_%{\zeta}(\mathbf{I}_{\mathrm{B}}),\delta_{\zeta}(\mathbf{I}^{\prime}_{\mathrm{B%}})),\mathbf{H}_{\mathrm{GT,B}}\big{)}.$ |

We note that once combined, the correlation can also facilitate the projected consistent feature map to have clear contents with the promotion of the cross-modal intensity learning, which will be discussed in Section 5.2. As for the inference phase, only the cross-modal prediction branch of SCPNet functions.

### 4.4 Loss Function and Implementation Details

For the intra-modal self-supervised loss, we parameterize the hom*ography matrix by the offsets of four corner points to stabilize the training [12, 6, 8]. We use the $L_{1}$ norm on the differences between the predicted offsets $\mathbf{O}\in\mathbb{R}^{2\times 2\times 2}$ and the ground-truth $\mathbf{O}_{\mathrm{GT}}\in\mathbb{R}^{2\times 2\times 2}$, which can be formulated as:

$\mathcal{L}_{\mathrm{S}}=\|\mathbf{O}-\mathbf{O}_{\mathrm{GT}}\|_{1}.$ | (8) |

The hom*ography parameterization using offsets of four corner points can be found in the supplementary material.

We set the cross-modal intensity-based loss as follows:

$\mathcal{L}_{\mathrm{C}}=\frac{\|\mathbf{P}_{\mathrm{A}}-\mathbf{P}_{\mathrm{B%,W}}\|_{1}}{\|\mathbf{P}_{\mathrm{A}}-\mathbf{P}_{\mathrm{B}}\|_{1}},$ | (9) |

where the numerator minimize the differences between the consistent feature map $\mathbf{P}_{\mathrm{A}}$ and the warped one $\mathbf{P}_{\mathrm{B,W}}$, while the denominator maximizing the differences between $\mathbf{P}_{\mathrm{A}}$ and $\mathbf{P}_{\mathrm{B}}$, which can prevent invalid feature map projection.

We set $\lambda=0.1$ in Eq. 7 during training. We use the AdamW [31] optimizer, with the maximum learning rate of $4\times 10^{-4}$ for the network training. The batch size is set to $8$, with a total of $120000$ training iterations.

## 5 Experiments

### 5.1 Datasets and Experimental Settings

Datasets. We evaluate our SCPNet on cross-modal datasets including GoogleMap [45] and Flash/no-flash [21], cross-spectral datasets including Harvard [10] and RGB/NIR [5], together with the manually-made inconsistent dataset PDS-COCO [27]. The GoogleMap dataset contains satellite images and the corresponding map images, which can be used for navigation and geolocation. We use the same training and test data splitting as in [45]. The Flash/no-flash dataset contains 120 pairs of images that are with and without flash. We randomly select 60 image pairs for training and 60 for testing. For multi-spectral data, the Harvard dataset contains 77 real-world image scenes, with each scene containing 31 band images. We take the 16th band image of each scene as the reference image and form a cross-spectral image pair with the image of each remaining band respectively. The training and test data are divided by different scenes of 1170 and 1140 image pairs. For the RGB/NIR dataset, we use 103 pairs of images for training and 153 pairs for testing. PDS-COCO artificially simulates random combined changes in brightness, contrast, saturation, and hue noise to the MS-COCO dataset [30]. We use the same training and test splitting as the MS-COCO dataset.

Experimental Settings. The hom*ography deformation is generated in the same way as [8, 6, 12, 45], which randomly perturbs the four corner points of a $128\times 128$ image. Unless otherwise stated, the perturbation range is set to [-32, +32]. We adopt the mean average corner error (MACE) for hom*ography accuracy evaluation. Lower MACE indicates higher accuracy.

Comparison Approaches. We evaluate SCPNet on cross-modal and cross-spectral datasets with handcrafted approaches including SIFT [32], ORB [36], DASC [26], RIFT [29], unsupervised approaches including UDHN [35], CA-UDHN [44], biHomE [27], Baseshom*o [41], UMF-CMGR [14], and supervised approaches including DHN [12], MHN [28], LocalTrans [37], IHN [6], RHWF [8]. For SIFT, ORB, DASC, and UMF-CMGR, we choose RANSAC [17] as their hom*ography estimation and outlier rejection algorithm. In addition, UMF-CMGR is an image fusion approach based on registration, and we use the registration network part for comparison. We also tried to compare with the unsupervised approaches MU-Net [42] and NeMAR [2], but according to our experiments, neither of them performs successful hom*ography estimation. To make a more comprehensive comparison, we also evaluate our SCPNet on PDS-COCO [27].

### 5.2 Ablation

Ablation Study on GoogleMap Dataset. We conduct extensive ablation studies on the architecture and supervision of our SCPNet by evaluating the mean average corner error (MACE), as shown in Table 1. Using only cross-modal intensity-based learning for supervision leads to non-convergence or unsatisfactory results (Settings 1–4). In contrast, our intra-modal self-supervised learning achieves superior performance (Settings 5–9). Moreover, the results of SCPNet show gradual improvement as additional ablation components are incorporated.

Setting | Self | Correlation | Projection | Cross | MACE$\downarrow$ |

1 | ✗ | ✗ | ✗ | ✓ | NC |

2 | ✗ | ✓ | ✗ | ✓ | NC |

3 | ✗ | ✗ | ✓ | ✓ | 24.64 |

4 | ✗ | ✓ | ✓ | ✓ | 24.80 |

5 | ✓ | ✗ | ✗ | ✗ | 13.06 |

6 | ✓ | ✓ | ✗ | ✗ | 9.68 |

7 | ✓ | ✗ | ✓ | ✗ | 10.01 |

8 | ✓ | ✓ | ✓ | ✗ | 7.70 |

9 | ✓ | ✓ | ✓ | ✓ | 4.35 |

The Effectiveness of Correlation on Consistent Feature Map Projection. We further show the consistent feature maps produced by concatenation and correlation architecture in Fig. 4. It is observed that the correlation visibly facilitates the consistent feature map generation by the direct constraint of the input feature maps using the inner product. Therefore, cross-modal intensity-based learning can then be boosted by high-quality feature maps.

Dataset | GoogleMap | Flash/no-flash | |||||||

Offset | Easy | Moderate | Hard | Mean | Easy | Moderate | Hard | Mean | |

Handcrafted | SIFT [32] | 19.17 | 23.87 | 29.04 | 24.53 | 14.61 | 18.69 | 23.85 | 19.53 |

ORB [36] | 19.11 | 23.9 | 29.02 | 24.52 | 16.91 | 22.44 | 27.01 | 22.63 | |

DASC [26] | 14.29 | 20.73 | 28.12 | 21.76 | 11.64 | 19.50 | 28.11 | 20.59 | |

RIFT [29] | 10.43 | 15.46 | 21.93 | 16.55 | 11.22 | 13.95 | 21.66 | 16.21 | |

Unsupervised | UDHN [35] | 18.63 | 21.55 | 26.89 | 22.84 | 16.27 | 21.27 | 24.85 | 21.20 |

CA-UDHN [44] | 19.31 | 23.92 | 29.10 | 24.61 | 16.01 | 21.54 | 25.14 | 21.32 | |

biHomE [27] | NC | NC | NC | NC | 8.24 | 12.56 | 14.04 | 11.86 | |

Baseshom*o [41] | 19.43 | 23.97 | 28.66 | 24.49 | 19.45 | 24.73 | 29.66 | 25.12 | |

UMF-CMGR [14] | 19.22 | 24.01 | 29.02 | 24.60 | 17.99 | 22.43 | 28.40 | 23.49 | |

SCPNet (Ours) | 3.60 | 4.44 | 4.85 | 4.35 | 1.80 | 2.33 | 3.59 | 2.67 | |

Supervised | DHN [12] | 7.06 | 6.82 | 7.00 | 6.93 | 5.28 | 6.13 | 7.51 | 6.42 |

MHN [28] | 4.75 | 5.00 | 5.34 | 5.06 | 3.18 | 6.55 | 5.81 | 5.24 | |

LocalTrans [37] | 0.91 | 1.43 | 6.30 | 3.22 | 0.49 | 0.67 | 4.05 | 1.96 | |

IHN [6] | 0.70 | 0.96 | 1.06 | 0.92 | 0.76 | 0.65 | 0.94 | 0.80 | |

RHWF [8] | 0.62 | 0.68 | 0.93 | 0.76 | 0.79 | 0.68 | 0.53 | 0.65 |

Dataset | Harvard | RGB/NIR | |||||||

Offset | Easy | Moderate | Hard | Mean | Easy | Moderate | Hard | Mean | |

Handcrafted | SIFT [32] | 17.27 | 21.49 | 26.70 | 22.30 | 15.54 | 23.90 | 28.81 | 24.40 |

ORB [36] | 18.61 | 23.06 | 28.29 | 23.82 | 17.75 | 22.84 | 27.01 | 23.00 | |

DASC [26] | 11.85 | 18.29 | 25.03 | 19.05 | 13.50 | 17.91 | 25.73 | 19.78 | |

RIFT [29] | 10.41 | 15.69 | 21.98 | 16.62 | 11.22 | 13.80 | 23.34 | 16.84 | |

Unsupervised | UDHN [35] | 18.03 | 22.20 | 26.55 | 22.69 | 18.54 | 23.27 | 27.16 | 23.43 |

CA-UDHN [44] | 18.77 | 23.64 | 28.55 | 24.14 | 18.31 | 23.88 | 28.66 | 24.12 | |

biHomE [27] | NC | NC | NC | NC | 18.61 | 23.05 | 28.18 | 23.77 | |

Baseshom*o [41] | 19.77 | 24.20 | 28.46 | 24.57 | 19.23 | 23.44 | 28.89 | 24.41 | |

UMF-CMGR [14] | 16.61 | 21.08 | 26.25 | 21.81 | 17.04 | 22.16 | 26.53 | 22.38 | |

SCPNet (Ours) | 2.34 | 3.70 | 5.48 | 4.00 | 1.65 | 4.69 | 7.13 | 4.78 | |

Supervised | DHN [12] | 5.30 | 6.34 | 8.09 | 6.72 | 9.55 | 10.08 | 14.87 | 11.88 |

MHN [28] | 4.37 | 5.09 | 6.27 | 5.35 | 6.88 | 7.10 | 8.26 | 7.51 | |

LocalTrans [37] | 0.27 | 0.43 | 4.58 | 2.04 | 0.53 | 0.77 | 5.15 | 2.47 | |

IHN [6] | 1.40 | 1.72 | 2.03 | 1.75 | 1.25 | 2.14 | 2.21 | 1.90 | |

RHWF [8] | 1.37 | 1.76 | 1.85 | 1.68 | 0.68 | 1.44 | 1.08 | 1.07 |

### 5.3 Evaluation on Cross-modal/spectral Datasets

We divide testing image pairs into three levels by the degree of ground-truth offsets, and define the $0\sim 30\%$ as ‘Easy’, the $30\sim 60\%$ as ‘Moderate’, and the $60\sim 100\%$ as ‘Hard’.Table 2 shows the quantitative comparison of cross-modal datasets. On GoogleMap, hom*ography estimation faces greater challenges due to the large modality differences between image pairs. The handcrafted and other unsupervised approaches produce unsatisfactory results even under the Easy level. On the contrary, our SCPNet can produce stable and accurate hom*ography estimation, owning 37.2% and 14.0% lower MACEs than the supervised approaches DHN and MHN. On Flash/no-flash, SCPNet also provides the best performance compared to all handcrafted and unsupervised approaches, and is superior to the supervised DHN and MHN. However, our SCPNet fails to suppress the supervised approaches LocalTrans, IHN and RHWF. It is due to the supervision difference and their architectures that combine iterative and multi-scale refinement.

Table 3 lists the results of the cross-spectral datasets. It is observed that our SCPNet consistently surpasses other handcrafted and unsupervised approaches on Harvard and RGB/NIR datasets. On Harvard, SCPNet outperforms DHN and MHN by 40.5%, 25.2% respectively. We note that on Harvard dataset, the images are under intensity and gradient variation caused by the alternation of 31 spectral bands, but the training strategy of our SCPNet still works robustly. On RGB/NIR dataset, SCPNet also outperforms a part of supervised, unsupervised, and handcrafted methods as in other datasets.

Fig. 5 visualizes the hom*ography estimation results of SCPNet and other comparison approaches on GoogleMap, Flash/no-flash, Harvard, and RGB/NIR datasets. It can be seen that our SCPNet can produce accurate hom*ography predictions on a variety of data, while the others fail due to the large modality/spectral variance and hom*ography deformation.

### 5.4 Evaluation on PDS-COCO

We further conduct an evaluation on PDS-COCO, with results illustrated in Table 4. Following [27], $\delta$ represents the content distortion of brightness, contrast, saturation, and hue, the absolute value of which is bigger when there is a larger distortion. As the intensity and gradient variation of PDS-COCO dataset is inferior to the cross-modal/spectral ones, some unsupervised methods such as biHomE and UDHN produces much more accurate hom*ography estimation than on the previous datasets. However, they are still inferior to our SCPNet. SCPNet leads biHomE by 58.2% under the maximum content distortion and 59.5% under the minimum one, and also outperforms the supervised methods including DHN and MHN.

Distortion | Unsupervised | Supervised | |||||||

UDHN | CA-UDHN | biHomE | SCPNet (Ours) | DHN | MHN | LocalTrans | IHN | RHWF | |

$\delta=\pm 8$ | 3.24 | NC | 2.20 | 0.89 | 2.09 | 1.07 | 0.68 | 0.19 | 0.07 |

$\delta=\pm 16$ | 5.51 | NC | 2.37 | 0.88 | 2.24 | 1.10 | 0.70 | 0.19 | 0.07 |

$\delta=\pm 32$ | NC | NC | 2.61 | 1.09 | 2.50 | 1.22 | 0.79 | 0.21 | 0.09 |

### 5.5 Computational Burden

The computational burden of two-branch intra-modal self-supervised learning in the training process mainly involves the extra synthetic data generation, the network forward propagation, and the computation and backward propagation of the self-supervised loss. Table.5 lists the training time and memory usage on an NVIDIA GeForce RTX 4090. Besides, we note that the inference time and memory usage will not increase.

Setting | Time (Hours) | Memory usage (MBs) |

w/o Self | 3.18 | 4622 |

w/ Self | 6.98 | 9144 |

## 6 Conclusions

We have proposed a novel unsupervised cross-modal hom*ography estimation framework, named SCPNet. The concept of intra-modal self-supervised learning is introduced for the first time, wherein two-branch self-supervised information is fully exploited by applying simulated hom*ography within two modalities, providing strong support for unsupervised cross-modal training. Building upon this, by combining correlation and consistent feature map projection, SCPNet achieves successful unsupervised hom*ography estimation on multiple challenging datasets. Extensive experiments demonstrate the effectiveness of SCPNet in handling large offsets and modality gaps.

Limitations. The hom*ography estimation network of SCPNet is designed for better facilitating the unsupervised training framework. The strategies such as multi-scale, iteration, and replacing CNN using transformer that can further improve the hom*ography estimation accuracy at the network design level can be further investigated in our future work.

## Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under Grant No. 2023YFB3209800, in part by “Pioneer” and “Leading Goose” R & D Program of Zhejiang under grant 2023C03136, in part by Zhejiang Provincial Natural Science Foundation of China under Grant No. LD24F020003, in part by the National Natural Science Foundation of China under Grant No. 62301484.

## References

- [1]Aguilera, C.A., Sappa, A.D., Toledo, R.: LGHD: A feature descriptor formatching across non-linear intensity variations. In: Proceedings of the IEEEInternational Conference on Image Processing. pp. 178–181. IEEE (2015)
- [2]Arar, M., Ginger, Y., Danon, D., Bermano, A.H., Cohen-Or, D.: Unsupervisedmulti-modal image registration via geometry preserving image-to-imagetranslation. In: Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition. pp. 13410–13419 (2020)
- [3]Barath, D., Matas, J., Noskova, J.: MAGSAC: marginalizing sample consensus.In: Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. pp. 10197–10205 (2019)
- [4]Bay, H., Tuytelaars, T., VanGool, L.: SURF: Speeded up robust features. In:Proceedings of the European Conference on Computer Vision. pp. 404–417.Springer (2006)
- [5]Brown, M., Süsstrunk, S.: Multi-spectral SIFT for scene categoryrecognition. In: Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition. pp. 177–184 (2011)
- [6]Cao, S.Y., Hu, J., Sheng, Z., Shen, H.L.: Iterative deep hom*ography estimation.In: Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. pp. 1879–1888 (2022)
- [7]Cao, S.Y., Shen, H.L., Chen, S.J., Li, C.: Boosting structure consistency formultispectral and multimodal image registration. IEEE Transactions on ImageProcessing 29, 5147–5162 (2020)
- [8]Cao, S.Y., Zhang, R., Luo, L., Yu, B., Sheng, Z., Li, J., Shen, H.L.: Recurrenthom*ography estimation using hom*ography-guided image warping and focustransformer. In: Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition. pp. 9833–9842 (2023)
- [9]Caruana, R.: Multitask learning. Machine learning 28, 41–75 (1997)
- [10]Chakrabarti, A., Zickler, T.: Statistics of real-world hyperspectral images.In: Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. pp. 193–200 (2011)
- [11]Chen, S.J., Shen, H.L., Li, C., Xin, J.H.: Normalized total gradient: A newmeasure for multispectral image registration. IEEE Transactions on ImageProcessing 27(3), 1297–1310 (2017)
- [12]DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image hom*ography estimation.arXiv preprint arXiv:1606.03798 (2016)
- [13]Dharejo, F.A., Zawish, M., Deeba, F., Zhou, Y., Dev, K., Khowaja, S.A.,Qureshi, N.M.F.: Multimodal-boost: Multimodal medical image super-resolutionusing multi-attention network with wavelet transform. IEEE/ACM Transactionson Computational Biology and Bioinformatics (2022)
- [14]Di, W., Jinyuan, L., Xin, F., Liu, R.: Unsupervised misaligned infrared andvisible image fusion via cross-modality image generation and registration.In: International Joint Conference on Artificial Intelligence (2022)
- [15]Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In:Proceedings of the IEEE international conference on computer vision. pp.2051–2060 (2017)
- [16]Dubrofsky, E.: hom*ography estimation. Diplomová práce. Vancouver:Univerzita Britské Kolumbie 5 (2009)
- [17]Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for modelfitting with applications to image analysis and automated cartography.Communications of the ACM 24(6), 381–395 (1981)
- [18]Girdhar, R., Singh, M., Ravi, N., Van DerMaaten, L., Joulin, A., Misra, I.:Omnivore: A single model for many visual modalities. In: Proceedings of theIEEE/CVF conference on computer vision and pattern recognition. pp.16102–16112 (2022)
- [19]Goforth, H., Lucey, S.: GPS-denied UAV localization using pre-existingsatellite imagery. In: 2019 International Conference on Robotics andAutomation. pp. 2974–2980. IEEE (2019)
- [20]Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances inNeural Information Processing Systems 27 (2014)
- [21]He, S., Lau, R.W.: Saliency detection with flash and no-flash image pairs. In:Proceedings of the European Conference on Computer Vision. pp. 110–124.Springer (2014)
- [22]Holland, P.W., Welsch, R.E.: Robust regression using iteratively reweightedleast-squares. Communications in Statistics-theory and Methods6(9), 813–827 (1977)
- [23]Hong, M., Lu, Y., Ye, N., Lin, C., Zhao, Q., Liu, S.: Unsupervised hom*ographyestimation with coplanarity-aware GAN. In: Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 17663–17672(2022)
- [24]Hu, R., Singh, A.: Unit: Multimodal multitask learning with a unifiedtransformer. In: Proceedings of the IEEE/CVF international conference oncomputer vision. pp. 1439–1449 (2021)
- [25]Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time styletransfer and super-resolution. In: Proceedings of the European Conference onComputer Vision. pp. 694–711. Springer (2016)
- [26]Kim, S., Min, D., Ham, B., Ryu, S., Do, M.N., Sohn, K.: DASC: Dense adaptiveself-correlation descriptor for multi-modal and multi-spectralcorrespondence. In: Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition. pp. 2103–2112 (2015)
- [27]Koguciuk, D., Arani, E., Zonooz, B.: Perceptual loss for robust unsupervisedhom*ography estimation. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. pp. 4274–4283 (2021)
- [28]Le, H., Liu, F., Zhang, S., Agarwala, A.: Deep hom*ography estimation fordynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition. pp. 7652–7661 (2020)
- [29]Li, J., Hu, Q., Ai, M.: RIFT: Multi-modal image matching based onradiation-variation insensitive feature transform. IEEE Transactions on ImageProcessing 29, 3296–3310 (2019)
- [30]Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context.In: Proceedings of the European Conference on Computer Vision. pp. 740–755.Springer (2014)
- [31]Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXivpreprint arXiv:1711.05101 (2017)
- [32]Lowe, D.G.: Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision 60(2), 91–110 (2004)
- [33]Marivani, I., Tsiligianni, E., Cornelis, B., Deligiannis, N.: Designing cnnsfor multimodal image restoration and fusion via unfolding the method ofmultipliers. IEEE Transactions on Circuits and Systems for Video Technology32(9), 5830–5845 (2022)
- [34]Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: A versatile andaccurate monocular SLAM system. IEEE Transactions on Robotics31(5), 1147–1163 (2015)
- [35]Nguyen, T., Chen, S.W., Shivakumar, S.S., Taylor, C.J., Kumar, V.: Unsuperviseddeep hom*ography: A fast and robust hom*ography estimation model. IEEE Roboticsand Automation Letters 3(3), 2346–2353 (2018)
- [36]Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: An efficientalternative to sift or surf. In: Proceedings of the IEEE/CVF InternationalConference on Computer Vision. pp. 2564–2571 (2011)
- [37]Shao, R., Wu, G., Zhou, Y., Fu, Y., Fang, L., Liu, Y.: LocalTrans: Amultiscale local transformer network for cross-resolution hom*ographyestimation. In: Proceedings of the IEEE/CVF International Conference onComputer Vision. pp. 14890–14899 (2021)
- [38]Wang, C., Wang, X., Bai, X., Liu, Y., Zhou, J.: Self-supervised deep hom*ographyestimation with invertibility constraints. Pattern Recognition Letters128, 355–360 (2019)
- [39]Xu, H., Ma, J., Yuan, J., Le, Z., Liu, W.: RFNet: Unsupervised network formutually reinforcing multi-modal image registration and fusion. In:Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. pp. 19679–19688 (2022)
- [40]Yasuma, F., Mitsunaga, T., Iso, D., Nayar, S.K.: Generalized assorted pixelcamera: postcapture control of resolution, dynamic range, and spectrum. IEEETransactions on Image Processing 19(9), 2241–2253 (2010)
- [41]Ye, N., Wang, C., Fan, H., Liu, S.: Motion basis learning for unsupervised deephom*ography estimation with subspace projection. In: Proceedings of theIEEE/CVF International Conference on Computer Vision. pp. 13117–13125 (2021)
- [42]Ye, Y., Tang, T., Zhu, B., Yang, C., Li, B., Hao, S.: A multiscale frameworkwith unsupervised learning for remote sensing image registration. IEEETransactions on Geoscience and Remote Sensing 60, 1–15 (2022)
- [43]Ying, J., Shen, H.L., Cao, S.Y.: Unaligned hyperspectral image fusion viaregistration and interpolation modeling. IEEE Transactions on Geoscience andRemote Sensing (2021)
- [44]Zhang, J., Wang, C., Liu, S., Jia, L., Ye, N., Wang, J., Zhou, J., Sun, J.:Content-aware unsupervised deep hom*ography estimation. In: Proceedings of theEuropean Conference on Computer Vision. pp. 653–669. Springer (2020)
- [45]Zhao, Y., Huang, X., Zhang, Z.: Deep Lucas-Kanade hom*ography for multimodalimage alignment. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. pp. 15950–15959 (2021)
- [46]Zhou, Y., Rangarajan, A., Gader, P.D.: An integrated approach to registrationand fusion of hyperspectral and multispectral images. IEEE Transactions onGeoscience and Remote Sensing 58(5), 3020–3033 (2019)