3 Multisensor Image Registration using Correlation-based Deep Adversarial Networks

(1)

3 Multisensor Image Registration

using Correlation-based Deep

Adversarial Networks

3.1 Introduction

As discussed in the context of Chapter 2, the availability of satellite data is constantly growing. This wealth of data comes from a variety of passive and active instruments with different spatial resolutions, frequencies, polar-izations, etc. To exploit all the images available over a given scene, accurate registration is necessary. While semi-interactive procedures based on manu-ally annotated control points are consolidated, automatic image registration is a challenging problem, especially when the reference and input images to be registered have been acquired by distinct sensors. Indeed, the main source of difficulty arises from the need to spatially relate data that are characterized by intrinsically different physical properties, statistics, and textures [16].

As already discussed in Chapter 2, automatic registration methods can

This chapter is based on the following publication:

L. Maggiolo, D. Solarna, G. Moser and S. B. Serpico, "Automatic area-based registration of optical and SAR images through generative adversarial networks and a correlation-type metric", Proceedings of the 2020 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2020), Waikoloa, Hawaii, USA, 2020.

(2)

be broadly divided into two categories: feature-based and area-based. The former rely on the extraction of relevant spatial features from the reference and input images and are generally faster but less accurate than the latter. Their residual error usually depends on the accuracy of the feature extraction. Conversely, area-based methods operate directly on the whole image area and do not focus on specific point or linear features. They generally provide lower registration error at the price of higher computational burden. An analysis of such complementarity between the two categories has already been discussed in Section 2.3 with respect to the two-step registration strategy.

Concerning feature-based registration methods, well-known approaches include feature-point registration algorithms (e.g. Harris point detectors) or scale-invariant feature transforms (SIFT) [16]. Area-based methods often rely on cross-correlation-type and information-theoretic metrics (e.g., mutual information, MI). The latter metrics are computationally heavier but more robust to differences in the image properties, owing to their ability to capture dependencies across the intensity distributions. Correlation-type similarities are faster, due to fast Fourier transform formulations, but less powerful and generally ineffective in multisensor scenarios [16,114].

In Chapter 2, the focus of the experimental analysis was on planetary image registration using optical imagery. Conversely, this chapter focuses on a different problem, which is multisensor image registration of EO imagery. In 1995, Li et al. proposed a feature-based multisensor image registration method using region boundaries and other strong edges as matching primi-tives [19]. The rationale is that, besides optical and SAR images have differ-ent radiometric properties, the contours represdiffer-enting the region boundaries in the scene are preserved in most cases. The adopted transformation model is the rigid transformation, the same model used in the area-based solution proposed in [115], that makes use of an histogram-based computation of the mutual information as a similarity measure between the optical and SAR data.

In the last decade, another line of research in the field of multisensor image registration has proposed different solutions based on the Wavelet de-composition [40,41] of the input images. In 2010, Wong and Clausi proposed, in a multiscale framework, a feature-based registration method based on the local phase coherence of the images. The proposed feature representation was proven less sensitive than other metrics to the control point outliers that, in

(3)

3.1. INTRODUCTION

case of multisensor registration, are extremely common due to the different nature of the data. Similarly, the work in [116] integrates the multiscale de-composition with the extraction and matching of line segments. Conversely, the solution proposed in [117] is an area-based solution based on an infor-mation theoretic functional that is optimized via simulated annealing. In this case, the wavelet decomposition is applied with the aim of reducing the computational burden required by the global optimization strategy.

More recent solutions are based on the customization of local descriptors. Two modified versions of the SIFT method for the specific case of optical and SAR matching have been proposed in [118] and [119]. Both methods modified the SIFT processing chain in order to take into account the radiometric differences of optical and SAR data. The former refined the standard SIFT output by exploring the spatial relationship of the extracted features and taking into account the possible presence of speckle, while the latter proposed a modification in the keypoint detection phase by working in two Harris scale spaces. In 2017, the work in [120] proposed: (i) another multisensor descriptor, named histogram of orientated phase congruency (HOPC), which is based on the structural properties of images; and (ii) another similarity

metric, named HOPCncc, which uses the normalized correlation coefficient

(NCC) of the HOPC descriptors for multimodal registration. Another recent example of a multisensor registration strategy based on the extraction of local descriptors is the one proposed in [121], where a phase congruency structural descriptor (PCSD) is combined with uniform nonlinear diffusion to reduce the influence of speckle in the extraction phase.

From a methodological perspective, the registration of multisensor re-mote sensing data implies modeling the dependence across heterogeneous data sources, a scenario that also relates to the areas of domain adapta-tion and image-to-image translaadapta-tion [122]. In this respect, besides the more standard multisensor image registration methods described so far, deep learn-ing methods based on generative adversarial networks (GANs) have recently attracted considerable attention. A GAN is made of two neural networks trained in competition and is aimed to generate, from input noise data, out-put data whose distribution matches that of a target source [123]. In a conditional GAN (cGAN), this idea is further extended to allow for input data conditioned to a non-noise source [124]. Further extensions also address cyclic consistency in translating two sources onto each other [125].

(4)

The potential of GAN architectures in the framework of multisensor reg-istration has been recently demonstrated in [126] in a feature-based scenario. A cGAN is trained to translate from input optical to output SAR data, and the outcome is used to generate matching control points. The method in-volves the need to pre-select matching areas, using an external land cover layer (e.g., CORINE) and a manual refinement, to include salient planar features and exclude elevated objects [126]. Other recent works addressing multisensor registration through GAN approaches are [127–129].

This chapter proposes a novel multisensor registration method based on the integration of GAN and area-based registration concepts. Focusing on the case of multispectral and SAR data, the cGAN model in [124] is trained to estimate what the optical image would look like if it was acquired

through the SAR sensor. Then, a correlation-type ¸2_{similarity metric and the}

"constrained optimization by linear approximation" (COBYLA) algorithm for derivative-free optimization [130] are combined to automatically achieve registration. The rationale is to exploit the power of GAN approaches to translate the reference and input images to a common domain in which their similarity allows for a computationally efficient correlation-type functional to be applicable and effective. Moreover, the adopted area-based strategy avoids the need for feature extraction stages or for semi-manually selecting image areas with appropriate spatial features.

Section 3.2 will briefly introduce the use of deep learning in the con-text of image registration. In particular, Section 3.2.1 will cover the details of the recent growth in the deep learning research, starting from its origin and moving to the main reasons why it is nowadays one of the most impor-tant and widespread machine learning frameworks. Moreover, Section 3.2.2 will overview the state-of-the-art methods concerning image-to-image trans-lation via deep learning solutions and Section 3.2.3 will briefly recap the theory behind conditional generative adversarial networks, with focus on the loss function and the adversarial formulation. Then, Section 3.3 will be fo-cused on the technical aspects of the proposed multisensor image registration methodology, first providing an overview of the adopted method’s flowchart (Section 3.3.1), then detailing the proposed deep learning architecture (Sec-tion 3.3.2) and the image registra(Sec-tion strategy (Sec(Sec-tion 3.3.3). Finally, the experimental analysis, with details on the dataset used for the experiments, will be provided in Section 3.4, while the conclusions, with the possible future developments, will be drawn in Section 3.5.

(5)

3.2. DL FOR IMAGE-TO-IMAGE TRANSLATION

3.2 An Introduction on the use of Deep

Learn-ing for Image-to-Image Translation

In the last decade, deep learning (DL), and deep neural networks (DNNs) in particular, have established themselves as the leading solution in the signal processing world, achieving state-of-the-art performance in image, audio, and natural language understanding. The recent advances in deep learning have opened the door to the most recent applications and researches, including autonomous driving, virtual assistants (e.g., voice search and voice-activated assistants), real-time machine translation, social-network filtering, recom-mendation systems, etc. Also in the context of remote sensing, a large body of research has been devoted to the application of deep learning for typical remote sensing applications, ranging from semantic classification and auto-matic change detection to image-to-image translation, image registration, etc [123].

3.2.1 A Brief History of Deep Learning

The origin of the current deep learning evolution (Figure 3.1) is located back in time, more precisely between the ’50s and the ’60s. In 1957, the Perceptron system by Frank Rosenblatt [132], able to distinguish different types of objects from a simple camera, gathered particular interest. The interest in the perceptron model continued through the ’60s when, in 1969, a publication by Marvin Minksy and Seymour Papert [133] contained a proof showing that linear perceptrons could not classify the behavior of a nonlinear function (i.e., the XOR operator). Despite the limitations of such proof, the many critics that arose, and the fact that nonlinear perceptron existed at that time, the neural network research suffered a decline till the 80s.

Due to the increase in computing power and the development of the back-propagation technique, in the ’80s the neural networks regained attention. Computers had the power to train larger and deeper networks, leading to the first convolutional neural network by Fukushima [134]. Such architecture was based on the model of the visual recognition system from mammalian brains and was able to efficiently recognize complex images such as handwritten digits and faces.

(6)

F igu re 3. 1: T im elin e of th e de ep lear nin g ev ol uti on in th e las t de cad es (im age by Fa vio V áz qu ez [131] ).

(7)

3.2. IMAGE-TO-IMAGE TRANSLATION USING GANS

again in favour of models like support vector machines (SVMs) and decision trees [135]. SVMs were proven to be excellent classifiers for many appli-cations, especially when coupled with human-engineered features. Feature engineering, which involves the extraction of informative elements from the input data (e.g., the borders or the texture information for image processing applications) to be combined and fed into the classifiers, became popular. Indeed, it has nowadays been proven [123] that deep neural networks are able to automatically recognize, extract, and combine features that are very similar to the ones that are usually hand-engineered, provided that suitably large training sets are available.

With the advent of general-purpose programming on graphics process-ing units (GPUs) in the late 2000s, neural network architectures were able to make great strides over the competition. Another important factor in the rise of deep learning solutions in the last two decades is the availability of very large datasets. Combining the capability of training larger networks in shorter time with the exponentially growing availability of data, from the late 2000s the neural networks gained the most popularity in many fields. This dominance has continued in the succeeding years, with improved tech-niques (e.g., more sophisticated activation functions like the Leaky-ReLU function [136], dropout regularization [137], adversarial networks [138], etc.) and applications of neural networks to areas outside of image recognition, including translation, speech recognition, and image synthesis (e.g., bidi-rectional recurrent neural networks (bidibidi-rectional RNNs) [139], long-short-term-memory (LSTM) networks [140], generative adversarial networks [138], image-to-image translation [124], etc.).

3.2.2 Overview of Image-to-Image Translation

Meth-ods based on Generative Adversarial Networks

Many image processing, computer graphics, and computer vision problems can be viewed as “translating” an input image into a corresponding output image. Just as a concept may be expressed in either English or Italian, a scene may be rendered as an RGB image, a gradient field, an edge map, a semantic label map, etc. In analogy to automatic language translation, it is possible to define automatic image-to-image translation as the task of translating one possible representation of a scene into another, given sufficient training data, despite the fact that the setting is always the same: predict

(8)

pixels from pixels [124].

Image-to-image translation can be formulated as a per-pixel classifica-tion or regression. An example is provided in [141], where a skip-connecclassifica-tion architecture combines semantic information from a deep and coarse layer (appearance information) with a shallow fine layer in order to produce accu-rate and detailed segmentation results. Other examples are the cases of [142] and [143] that face the problems of edge detection and image colourization, respectively.

The previous examples are related to unstructured predictions, in the sense that each output pixel is considered conditionally independent from all others, given the input image. Conditional GANs, conversely, learn a structured loss that penalizes the joint configuration of the output. In the literature, different structured approaches for image-to-image translation ex-ist. An example is given by [144], where deep convolutional networks are combined with fully connected conditional random fields. In addition, the work in [145] presents a novel type of metrics, called deep perceptual simi-larity metrics, that allow mitigating the over-smoothness in the results that is typical of distance-based loss functions applied to the image domain. An-other work presenting a structured perceptual loss function is [146], where the loss function is not computed on a per-pixel basis and it is based on high-level features extracted from pretrained networks.

Conditional GANs have been applied not only to images, but also to text [147], discrete labels [148], and speech [149]. In [147], cGANS are ap-plied to generate images from a textual description, while [148] uses cGANs to generate multimodal distributions of tag-vectors conditioned to image fea-tures, and [149] uses conditional adversarial training to predict emotions from speech signals.

Concerning image inputs, cGANs have been deployed in the context of many different applications in the last years. Examples are provided by [150] for future frame prediction from a time series of images, [151] for product photo generation, and in particular for the generation of pieces of clothing from input images of dressed persons, [152] for image generation from sparse annotation, whose architecture is able to generate realistic outdoor scene images under different conditions (e.g., day-night and sunny-foggy with clear object boundaries), and [153], that adopts a cGAN framework for predicting images from normal maps.

(9)

3.2. THEORETICAL FOUNDATION OF CGANS

Moreover, there is another category of studies that have deployed GANs for image-to-image mapping while conditioning the output on the input. Dif-ferently from above, such works deploy the GANs unconditionally, and then rely on other terms (such as L2 regression) to force the output to be condi-tioned on the input. Examples are represented by [154] for image inpaint-ing, [155] for future state prediction, and in particular for creating images of objects at future times given time-lapse videos as input data, [156] for style transfer and texture synthesis, and [157] for super resolution, which is able to infer photo-realistic natural images using an upscaling factor of up to 4.

3.2.3 Theoretical Foundation of Conditional

Genera-tive Adversarial Networks

For the sake of clarity, let us focus first of generative adversarial networks [138] and then move to the more specific case of conditional adversarial net-works. A GAN consists of two adversarial models: (i) a generative model G that captures the data distribution; and (ii) a discriminative model D that estimates the probability that a sample came from the training data rather than from G. Both G and D can be any non-linear mapping function, such as a multi-layer perceptron, or a convolutional neural network in case of input image data.

For the sake of simplicity, let us consider the case in which both G and

D are multi-layer perceptrons [138]. Let pz(z) be a prior noise distribution,

then the goal of the generator is to learn a generator distribution pg(x),

over the input data x, able to map the random noise pz(z) to the data

space. Such mapping is defined as G(z). Conversely, the discriminator is aimed at learning the mapping D(x) that outputs a single scalar representing the probability that the data x came form training set rather than it was

generated from pg(·). Both G(·) and D(·) are parametrized by vectors of

model parameters, namely ◊g and ◊d.

In the training phase, D is trained so as to maximize the probability of assigning the correct label to both training examples and samples from

G. Simultaneously, G is trained to minimize the quantity log(1 ≠ D(G(z))).

(10)

value function V (G, D):

min

G maxD V(D, G) = Ex≥pdata(x)[log(D(x))]+

Ez≥pz(z)[log(1 ≠ D(G(z)))]

(3.1)

Generative adversarial networks can be extended to a conditional model by conditioning both the generator and the discriminator on some extra information y. Such additional information can be any type of auxiliary information, such as class labels or data from other modalities. Indeed, y is fed, as an additional input layer, into the generator and the discriminator (see Figure 3.2).

In the generator, the prior input noise pz(z) and y are combined into

a joint hidden representation while, in the discriminator, x and y are pre-sented as inputs and to a discriminative function. Therefore, the two-player minimax game of Eq. 3.1 becomes:

min

G maxD V(D, G) = Ex≥pdata(x)[log(D(x | y))]+

Ez≥pz(z)[log(1 ≠ D(G(z | y)))]

(3.2)

3.3 Proposed Methodology

3.3.1 Flowchart of the Proposed Method

Briefly recalling the notions introduced in Chapter 2, let R(x, y) and I(x, y) be two images to be registered, where x and y indicate the spatial coordinates

and R and I take on nonzero values in some bounded subset of R2 _{(i.e., R}

and I are compactly supported functions). Let the two images represent the same surface and let them be aligned differently.

In Chapter 2, the registration problem was introduced by considering a reference image R and an input image I. However, in this chapter, in order to better differentiate the type of data referred to, we will denote with R a multispectral image and with S a SAR image (acquired over the same ground area of R and to be registered). Consequently, let r and s denote multispectral and SAR data samples, respectively. Consistently with usual

(11)

3.3. FLOWCHART OF THE PROPOSED METHOD

Figure 3.2: Conditional adversarial network [148]

formulations of convolutional neural networks (CNNs), r and s collect data from image patches [123]. The proposed method is composed of two steps.

First, a cGAN is trained on matching pairs of optical-SAR patches and applied to R to predict an estimated SAR image ˆS that emulates the statisti-cal and spatial structure of SAR imagery on the frame of reference of R (see Section 3.3.2). As the cGAN is supervised, an assumption of the proposed

method is that a set {(ri, si)}li=1 of properly registered optical-SAR patches

is available. This set may be derived from previously registered data of the same sensors of R and S or from the manual registration of a subset of the (R, S) data set.

Then, consistently with the formulation used in Chapter 2, let R be the reference multispectral image and S be the input SAR image to be

trans-formed. A mapping Tp : R2 æ R2, from the coordinates in ˆS and R to those

in S, thus warping S so that it spatially matches R and ˆS, is searched for.

The optimal transformation Tpú is computed by maximizing an ¸2 similarity

functional C(Tp) over a predefined transformation space using COBYLA (for

(12)

The flowchart of the proposed method is shown in Figure 3.3. The details of the conditional GAN architecture will be analysed in depth in the following Section, while additional information on the registration step will be provided in Section 3.3.3.

3.3.2 Conditional GAN Stage

The pix2pix cGAN architecture in [124] is used to generate ˆS. It consists

of two CNNs that, in the proposed method, formalize the mapping G(·) (generator) from an optical patch r to an estimated SAR patch ˆs and the mapping D(·) (discriminator) from a candidate SAR patch to (0, 1). Indeed. the discriminator is meant to distinguish if the input SAR patch is real or generated by G, and the generator is aimed at producing output SAR patches that are accurate enough to fool the discriminator. Leveraging on the theory introduced in Section 3.2.3, this adversarial behavior is formalized through the loss function in Eq. 3.3 [124].

L(D, G) = Er,s{log D(r, s)} +

Er,z{log[1 ≠ D(r, G(r, z))]} +

⁄Er,s,z{Îs ≠ G(r, z)Î1} ,

(3.3)

where ⁄ is a positive weight coefficient, and z is a dropout noise. The

ad-ditional term based on the L1-norm in the loss function favours that the

generated image ˆS is structurally similar to the target image S. As for the

noise term, without z, the network could still learn the mapping from r to s, but it would produce deterministic outputs, and therefore fail to match any distribution other than a delta function. An option would be to add a noise contribution to the input data. However, based on the researches in [124,150], such strategy is not effective as the generator would simply learn to ignore the noise. Indeed, an effective solution is the one proposed by [124] and used also in the proposed framework, and consisting in providing the noise only in the form of dropout, applied on several layers of the generator at both training and test time.

The following two Sections provide details on the architectures of both the generator and the discriminator. Graphical representations are also pro-vided in Figure 3.3.

(13)

3.3. CONDITIONAL GAN STAGE F igu re 3. 3: P rop os ed ne tw or k ar ch ite ct ur e.

(14)

Figure 3.4: The adopted U-Net architecture as compared with a simple

encoder-decoder structure. The skip connections concatenate the outputs of the encoder layers with the outputs of the mirrored decoder’s layers [124].

Generator

The adopted CNN architecture is a U-Net model [158], one of the most widespread architecture, that consists of an overall autoencoder structure. It is composed of a set of downsampling layers (encoder), a bottleneck layer at the lowest resolution, and a set of upsampling layers (decoder) that are symmetric to the first part.

The U-Net model also includes skip connections between each layer i and layer n ≠ i, where n is the total number of layers. Each skip connection simply concatenates all channels at layer i with those at layer n ≠ i (see Figure 3.4). This is aimed at reducing the compression that the input data would suffer in case there would be no data transfer between the encoder and the decoder sides, and the input data would be forced to pass through all the layers, including the bottleneck. According to such a skip connection strategy, the feature maps of the downsampling layers are propagated to the corresponding up-sampling layers, so as to preserve high-resolution details and do not spatially over-compress the input data.

More in details, the structure of the generator is composed as follows: • The encoder is made of 8 blocks, each composed of convolution, batch

normalization [159] (except for the first one), leaky ReLU activation function, and followed by a pooling operator.

• The decoder is made of 8 blocks, each composed of transposed con-volution (performing the inverse operation of the pooling), batch

(15)

nor-3.3. TRANSFORMATION STAGE

malization, dropout (applied to the first 3 only) and ReLU activation function.

• There are skip connections linking each block in the encoder with the corresponding one in the decoder.

• The size of the filters is set equal to 5 ◊ 5 pixels in order to deal with possible residual subpixel alignment errors.

Discriminator

The adopted CNN architecture is a PatchGAN model [124] which stacks blocks composed of convolution, batch normalization (except for the first one) and leaky ReLU activation function, and followed by a pooling operator. Also in this case, the size of the used filters has been set equal to 5 ◊ 5.

The PatchGAN model is a type of discriminator used in generative ad-versarial networks which only penalizes structure at the scale of local image patches. The PatchGAN discriminator tries to classify if each patch in an image is real or fake. This discriminator is run convolutionally across the image, averaging all responses to provide the ultimate output of D. Such a discriminator effectively models the image as a Markov random field, assum-ing independence between pixels separated by more than a patch diameter. It can be understood as a type of texture/style loss.

The adopted loss function includes the L1-norm penalty term as shown

in Eq. 3.3. Such penalty enables the network to capture the low-frequency

details, at the expense of high-frequency crispness. Indeed, L2and L1 norms

are well known to produce blurry results in image generation problems [160]. Adopting a PatchGAN model, and thus restrict the attention to the structure in local image patches, allows the GAN discriminator to model the high-frequency structures that lacks in the general loss function.

3.3.3 Transformation Stage

As for the registration of planetary images in Section 2.3.4, we focus on the family of rotation-scale-translation (RST) transformations (i.e., similarity transformations) [16]. Considering R as the reference multispectral image

(16)

R2 æ R2 from the spatial coordinate system of the reference to that of the

input image ((x, y) œ R2_{) is:}

Tp(x, y) = C kcos ◊ k sin ◊ tx ≠k sin ◊ k cos ◊ ty DS W U x y 1 T X V, (3.4)

where txand tydetermine translations in the x and y directions, respectively

(tx, ty œ R), ◊ is the rotation angle (0 Æ ◊ < 2ﬁ), k is the scaling factor

(k > 0), and p = (tx, ty, ◊, k) is the vector collecting all transformation

parameters; p takes values in the set P = R2_{◊ [0, 2ﬁ) ◊ (0, +Œ). To register}

the two images, we look for the value pú of p œ P such that the transformed

input image S(Tpú(x, y)) (up to resampling on the corresponding pixel lattice)

best matches the reference image R(x, y) over its support.

As compared to larger families of transformations (e.g., affine or poly-nomial), RST models often represent an effective trade-off between the gen-erality of the transform and the dimensionality of the parameter space where

pú is sought for (four parameters for RST). In the proposed registration

framework, a simple ¸2 _{area-based similarity metric is used to measure the}

matching between the reference and the registered input images.

Considering the general multisensor registration case of a reference

mul-tispectral image R and an input SAR image S, the ¸2 _{area-based similarity}

metric would be defined as follows:

C(Tp) = ÈR, S(Tp(·))Í¸2 =

ÿ

x,yR(x, y)S(T

p(x, y)). (3.5)

Conversely, taking into consideration the conditional GAN setup in which the reference multispectral optical image R is translated into the

gen-erated SAR image ˆ_{S, the similarity metric becomes:}

C(Tp) = È ˆS, S(Tp(·))Í¸2 =

ÿ

x,y ˆ

S(x, y)S(Tp(x, y)). (3.6)

The resulting formulation is proportional to the evaluation, in the ori-gin, of the sample cross-correlation function between the generated SAR

(17)

3.3. TRANSFORMATION STAGE

Tp. Such a similarity metric can be computed very efficiently by exploiting

the matrix computation of the scalar product. Whereas the cross-correlation metric is typically ineffective for multisensor registration (i.e., as formulated in Eq. 3.5), its role in the proposed method and the resulting time efficiency are made possible by exploiting the domain adaptation capabilities of the cGAN and reformulating it according to Eq. 3.6.

C(·) is generally a non-differentiable function of the transformation

pa-rameters p, thus requiring a maximization method that does not need gra-dients, hessians, or higher-order terms. Among derivative-free optimiza-tion algorithms, the Constrained Optimizaoptimiza-tion by Linear Approximaoptimiza-tion (COBYLA) method is adopted as a usually accurate and efficient approach. It is a constrained maximization technique that iteratively approximates the optimization problem by a suitable sequence of linear programming sub-problems [130]. Leveraging on the idea of the Nelder–Mead optimization method [161], in each iteration of COBYLA, the objective and the constraint functions are linearly approximated by interpolation at the vertices of a trust region, and optimization is performed using these approximations. The size of the trust region is controlled by the algorithm and it is decreased as con-vergence is achieved. The initial and final values, being problem-dependent, are set by the user. Due to the need for computing the linear approxima-tions, COBYLA is generally applied to optimization problems with a small number of variables (e.g., less than 10), as it is the case of image registration with an RST transformation model. In general, the convergence of COBYLA is slower than that of gradient-based algorithms, i.e. more function evalua-tions are required to find the optimum. However, one of the salient features of COBYLA is its stability and the low number of parameters to be tuned for performing optimization (i.e., the initial and final value of the trust re-gion) [161].

In the application to C(·), box constraints can easily be predefined

with-out loss of generality: ◊ takes values in [0, 2ﬁ], tx and ty can be bounded

according to the sizes of R (or similarly ˆS) and S, and bounds on k can be defined as a function of the spatial resolutions. Moreover, the initial radius ﬂ of the trust region, which is used to guide the initial exploration of the search space [130], is automatically optimized through a grid search: a finite dictio-nary D of increasing radius values is defined, COBYLA is run separately for

each ﬂ œ D, leading to a candidate solution Tﬂ

(18)

highest matching value C(Tﬂ

pú) is selected.

3.4 Experimental Analysis

3.4.1 Dataset and Experimental Setup

The proposed method was experimentally validated using data acquired in 2018 by Sentinel-1 (S1) and Sentinel-2 (S2) over Amazonia, north of Pozo Colorado, Paraguay. The area corresponds to the inner part of the Sentinel-2 granule Sentinel-21KUQ (approx. 8000 ◊ 8000 pixels). In the first two rows of Figure 3.5 it is possible to appreciate 2 crops of such area. As for the Sentinel-2 image, a false-color composition of near-infrared (NIR), red, and green bands is used for visualization purposes.

Due to the different acquisition modalities, the Sentinel-1 data has been resampled onto a pixel grid having a spacing of 10m, which is the same spatial resolution of the optical imagery. The SAR data used for experiments (and visualized in the second row of Figure 3.5) is obtained from a time series of 7 S1 acquisitions using the multitemporal despeckling method in [162]. The S2 and the despeckled SAR images have been manually registered to be used for training the cGAN and testing the proposed method.

Concerning the cGAN training phase, the training set was composed of 187 patches (512 ◊ 512 pixels each) drawn from the Eastern area of the imaged scene. The pix2pix architecture described in Section 3.3.2 was used. It is worth recalling that the original pix2pix architecture has been slightly modified by adopting filters whose windows were 5 ◊ 5 pixels in size. This has been done in order to consider that, even if registered, the images could exhibit residual subpixel errors. The number of training epochs was set to 250. The training time resulted to be around 8 hours on a Tesla K80 GPU. The dictionary used for the optimization of the COBYLA’s hyperparameter was D = {20, 30, 40, 50, 60} and ⁄ was set according to the recommendations in the original cGAN paper [124].

3.4.2 Experimental Results and Comparison

The validation was twofold. First, on two test areas drawn from the West part of the scene and disjoint from all training patches (approx. 1400◊1400 pixels

(19)

3.4. EXPERIMENTAL RESULTS AND COMPARISON

each; see Figure 3.5), a semi-simulated data set was generated by applying a synthetic transformation to the SAR image. Then, the proposed method was applied to the resulting misregistered pair. In this case, a "ground truth transformation" existed and the root mean-square error (RMSE) [110] could be quantitatively evaluated. Second, an experiment with fully real data was conducted by applying the method to the registration of an additional SAR image of a third test area (partially overlapping with one of the two areas used in the semi-simulated case). In this experiment with fully real data, quantitative error figures could not be calculated, but the registration accuracy has been qualitatively evaluated by visual inspection.

The second experiment with a real unregistered dataset was also meant to test the robustness of the proposed cGAN-based image-to-image trans-lation task, since the involved additional SAR image was obtained by the multitemporal despeckling of S1 acquisitions collected in a different season with respect to that of the training data. Figure 3.5 shows the results of the cGAN-based image-to-image translation task. A visual comparison between the true and estimated SAR images points out the effectiveness of this ar-chitecture to emulate SAR data from input optical data, at least in the case of the considered data sets. Indeed, border effects between adjacent patches are barely visible.

Table 3.1 shows the RMSE and time obtained by the proposed method and by a state-of-the-art area-based approach. Such method is based on the computation of the mutual information between the input and reference im-ages. Indeed, the state-of-the-art solution does not require the domain adap-tation stage that in the proposed method is accomplished by the GAN, yet it requires computing a similarity metric that is generally computationally heavy, and thus affecting the computational time needed for convergence. The adopted transformation model is the rotation-scale-translation model, and the optimization is obtained by using either COBYLA or Powell’s al-gorithm. Powell’s algorithm is an unconstrained optimization method that emulates the conjugate gradient approach without using derivatives [130]. It was applied, together with barrier functions, in a formulation that was en-forcing the same constraints on the search space as for the experiments with COBYLA.

To calculate the time needed for the different methods to register the input images, it was taken into consideration only the time needed for the

(20)

(a) Area 1 - Optical (b) Area 2 - Optical

(c) Area 1 - SAR (d) Area 2 - SAR

(e) Area 1 - GAN-generated SAR (f) Area 2 - GAN-generated SAR Figure 3.5: Input image pairs and domain adaptation results.

(21)

3.4. EXPERIMENTAL RESULTS AND COMPARISON

Test

Area # COBYLA PowellPrevious Method ProposedMethod RMSE 12 2.3420.902 0.9531.058 0.2640.266

Avg. 1.622 1.005 0.265

Time ≥30s ≥50s ≥4s

Table 3.1: RMSE (in pixels) and computation time (in seconds).

evaluation of the metric along all iterations of the minimization procedure, without considering the time warping time, which is a common overhead for all techniques.

The proposed method obtained significantly subpixel error on both test areas (RMSE ƒ 0.26 pixel), thus suggesting the effectiveness of integrat-ing the cGAN with a correlation-type metric and usintegrat-ing COBYLA as min-imization method. The state-of-the-art solution obtained an accurate re-sult as well, yet with substantially larger error than the proposed technique. Moreover, the time taken by the developed method, including cGAN predic-tion and hyperparameter optimizapredic-tion in COBYLA, was more than 10 times shorter.

In addition, Figure 3.6 shows the results of the proposed method by superimposing the images before and after registration. Two displays are used. In the first one, the true and estimated SAR images are overlaid in a false-color composite. In the second one, the SAR image and the NIR channel are composed in a checkerboard pattern whose squares come from the two images alternately. The left panel regards one of the two test areas in the semi-simulated case and confirms a visually negligible registration error. The right panel concerns the real data set on the third area. The comparison between the before- and after-registration displays visually points out that the developed method obtained a remarkable improvement in spatial alignment, thus confirming its effectiveness in the application to real data as well. Residual mismatch can be noted around certain edges and is interpreted as a rather minor impact of the different seasonality of the input images and of the resulting changes between them (e.g., shapes and sizes of crops, rivers, etc.).

(22)

E xp erime nt w ith semi-sim ula ted da ta on tes t are a #2 E xp erime nt w ith fully rea lda ta on tes t are a #3 B efo re reg ist ra tio n Af ter reg ist ra tio n B efo re reg ist ra tio n Af ter reg ist ra tio n F igu re 3. 6: R esu lts of th e prop os ed m ult ise ns or regi str at ion m eth od . Top row : fal se-col or ov erl ay of th e real an d est im at ed SAR im age s (G = ref ere nc e; R = B = in pu t). B ot tom row : ch eck erb oar d dis pla y of th e SAR im age an d th e NI R ch an ne l.

(23)

3.5. CONCLUSION

3.5 Conclusion

A novel registration method for multisensor optical-SAR images has been proposed by integrating a conditional GAN architecture within an area-based

registration framework involving an ¸2 _{similarity metric and the COBYLA}

optimization algorithm. The conventional area-based solutions to the prob-lem of multisensor image registration usually make use of information-theoretic functionals to measure the similarity between the image pairs. By taking ad-vantage of the domain adaptation capabilities of conditional adversarial net-works, the need for computationally heavy metrics is relaxed, thus allowing the use of simpler and less demanding correlation-type similarity measures.

Experimental results with semi-simulated and real data suggested the capability of the method to obtain subpixel error and/or visually accurate results, yet dramatically reducing the computational time required by state-of-the-art solutions to register the pairs of multisensor images. Moreover, the experimental analysis with pairs of real images also allowed to assess the robustness of the proposed solution to different acquisition conditions. Indeed, the proposed solution exhibited a rather small impact of seasonal-ity issues and outperformed a previous area-based approach that used an information-theoretic metric.

Remarkably, a simple correlation-type metric, which is computation-ally efficient but usucomputation-ally ineffective in the application to multisensor image registration, has been proved to be effective within the proposed approach thanks to the integration of the registration method with the powerful image translation capabilities of the cGAN architecture.

Among the possible future generalizations, it is possible to enumerate the extension to the use of cyclic GANs, so as to loose the constraint of hav-ing a registered dataset for the cGAN trainhav-ing phase. Indeed, the proposed solution requires the input image pairs to be registered in order to train the conditional GAN architecture. Conversely, a cyclic scheme is able to cope with unregistered pair of images, thus allowing the application of such a correlation-type registration to the general case of multisensor registration, also in the cases where no ground truth is available, nor manual image reg-istration is feasible. Another possibility is the integration of the regreg-istration metric, such as the correlation-type metric or the mutual information, within the loss function of the neural model. In this way, the newly generated

(24)

im-agery would be already optimized for the similarity metric that will be used during the registration phase.