Figure 1. Deep supervised domain adaptation. In training, the semantic alignment loss minimizes the distance between samples from different domains but the same class label and the separation loss maximizes the distance between samples from different domains and class labels. At the same time, the classification loss guarantees high classification accuracy.

# Abstract

This work provides a unified framework for addressing the problem of visual supervised domain adaptation and generalization with deep models. The main idea is to exploit the Siamese architecture to learn an embedding subspace that is discriminative, and where mapped visual domains are semantically aligned and yet maximally separated. The supervised setting becomes attractive especially when only few target data samples need to be labeled. In this scenario, alignment and separation of semantic probability distributions is difficult because of the lack of data. We found that by reverting to point-wise surrogates of distribution distances and similarities provides an effective solution.

# Method

We are given a training dataset mades of pairs $\mathcal{D}_s = \{( x_i^s,y_i^s )\}_{i=1}^{N}$ with $x^s_i \in \mathcal{X}$ is a realization from a random variable $X^s$ and the label $y_i^s \in \mathcal{Y}$ is a realization from a random variable $Y$. In addition, we are also given the training data $\mathcal{D}_t = \{( x_i^t,y_i^t )\}_{i=1}^{M}$ with $x^t_i \in \mathcal{X}$ is a realization from a random variable $X^t$ and the label $y_i^t \in \mathcal{Y}$. We say that $X^s$ represents the source domain and that $X^t$ represents the target domain. We assume that there is a covariate shift1 between $X_s$ and $X_t$. Under this settings the goal is to learn a prediction function $f : X \rightarrow Y$ that during testing is going to perform well on data from the target domain.

The problem formulated thus far is typically referred to as supervised domain adaptation (SDA). In this work we are especially concerned with the version of this problem where only very few target labeled samples per class are available.

In general, $f$ could be modeled by the composition of two functions, i.e., $f = h \circ g$. Here $g : \mathcal{X} \rightarrow \mathcal{Z}$ would be an embedding from the input space $\mathcal{X}$ to a feature or embedding space $\mathcal{Z}$, and $h : \mathcal{Z} \rightarrow \mathcal{Y}$ would be a function for predicting from the feature space (Figure 1). With this notation we would have $f_s = h_s \circ g_s$ and $f_t = h_t \circ g_t$ , and the SDA problem would be about finding the best approximation for $g_t$ and $h_t$ , given the constraints on the available data. The unsupervised DA paradigm (UDA) assumes that $D_t$ does not have labels. In that case the typical approach assumes that $g_t = g_s = g$, and $f$ minimizes

where $E[·]$ denotes statistical expectation and $ell$ could be any appropriate loss function, while $g$ also minimizes

The purpose of \eqref{eq:2} is to align the distributions of the features in the embedding space, mapped from the source and the target domains. we refer to \eqref{eq:2} as the confusion alignment loss $(\mbox{CA})$.

Since we are interested in visual recognition, the embedding function g would be modeled by a convolutional neural network (CNN) with some initial convolutional layers, followed by some fully connected layers. Given $g_s = g_t = g$, the CNN parameters would be shared as in a Siamese2 architecture. In addition, the source stream would continue with additional fully connected layers for modeling h. See Figure 1. It is clear that in order to perform well, traditional methods need to align effectively. This can happen only if distributions are represented by a sufficiently large dataset. Therefore, UDA approaches in particular, are in a position of weakness because we assume $D_t$ to be small. Moreover, UDA approaches even with perfect confusion alignment, do not guarantee that samples from different domains but the same class label, would map nearby in the embedding space. SDA approaches easily address the semantic alignment problem by replacing \eqref{eq:2} with

where $C$ is the number of class labels, and $X_a^s = X^s | \{ Y=a \}$ and $X_a^t = X_t | \{ Y=a \}$ are conditional random variables. We refer to \eqre{eq:3} as the semantic alignment loss $(\mbox{SA})$, which clearly encourages samples from different domains but the same label, to map nearby in the embedding space.

The above analysis clearly indicates why SDA provides superior performance than UDA. But it also suggests that deep SDA approaches have not considered that greater performance could be achieved by encouraging class separation, meaning that samples from different domains and with different labels, should be mapped as far apart as possible in the embedding space.

This idea means that, in principle, a semantic alignment less prone to errors should be achieved by adding to \eqref{eq:3} the following term

where $k$ is a suitable similarity metric between the distributions of $X_a^s$ and $X_b^t$ in the embedding space, which adds a penalty when the distributions $p(g(X_a^s))$ and $p(g(X_b^t))$ come close, since they would lead to lower classification accuracy. We refer to \eqref{eq:4} as the separation loss ( S ).

Finally, we suggest that SDA could be approached by learning a deep model $f = h \circ g$ such that

We refer to \eqref{eq:5} as the classification and contrastive semantic alignment loss $(\mbox{CCSA})$. The classification network $h$ is trained only with source data, so $h_s = h$. In addition, to improve performance on the target domain, $h_t$ could be obtained via fine-tuning based on the few samples in $D_t$ , i.e.,

## Handling Scarce Target Data

When the size of the labeled target training dataset $D_t$ is very small, minimizing the loss \eqref{eq:5} becomes a challenge. The problem is that the semantic alignment loss as well as the separation loss rely on computing distances and similarities between distributions, and those are very difficult to represent with as few as one data sample. Rather than attempting to characterize distributions with statistics that require enough data, because of the reduced size of $D_t$ , we compute the distance in the semantic alignment loss \eqref{eq:3} by computing average pairwise distances between points in the embedding space, i.e., we compute

where it is assumed $y_i^s = y_j^t = a$. The strength of this approach is that it allows even a single labeled target sample to be paired with all the source samples, effectively trying to semantically align the entire source data with the few target data. Similarly, we compute the similarities in the separation loss \eqref{eq:4} by computing average pairwise similarities between points in the embedding space, i.e., we compute

where it is assumed that $y_i^s = a \ne y_j^t = b$.

# Extension to Domain Generalization

Figure 2. Deep domain generalization. In training, the semantic alignment loss minimizes the distance between samples from different domains but the same class label and the separation loss maximizes the distance between samples from different domains and class labels. At the same time, the classification loss guarantees high classification accuracy. In testing, the embedding function embeds samples from unseen distributions to the domain invariant space and the prediction function classifies them $(\mbox{right})$. In this figure, different colors represent different domain distributions and different shapes represent different classes.

In visual domain generalization $(\mbox{DG})$ , D labeled datasets $D_{s_1},\dots,D_{s_D}$ , representative of D distinct source domains are given. The goal is to learn from them a visual classifier $f$ that during testing is going to perform well on data $D_t$ , no available during training, thus representative of an unknown target domain. In domain generalization, we are not interested in adapting the classifier to the target domain, because it is unknown. Instead, we want to make sure that the embedding $g$ maps to a domain invariant space. To do so we consider every distinct unordered pair of source domains $(u, v)$, represented by $D_{s_u}$ and $D_{s_v}$ , and, like in SDA, impose the semantic alignment loss \eqref{eq:3} as well as the separation loss \eqref{eq:4} (Figure 2) The network architecture is still the one in Figure 1, and we have implemented it with the same choices for distances and similarities as those made in Method. However, since we are summing the losses \eqref{eq:3} and \eqref{eq:4} over every unordered pair of source domains, there is a quadratic growth of paired training samples. So, if necessary, rather than processing every paired sample, we select them randomly.

# Experiments

For more extensive and detailed results we invite to read the manuscript.

We present results using the Office dataset3, the MNIST dataset4, and the USPS dataset5.

### Office Dataset

We consider six domain shifts using the three domains $(\mathcal{A} \rightarrow \mathcal{W}, \mathcal{A} \rightarrow \mathcal{D}, \mathcal{W} \rightarrow \mathcal{A}, \mathcal{W} \rightarrow \mathcal{D}, \mathcal{D} \rightarrow \mathcal{A}$, and $\mathcal{D} \rightarrow \mathcal{W})$.

In the first experiment we followed the setting described in Tzeng et. al.6: we use all classes of the office dataset with 5 train-test splits. For the source domain, 20 examples per category for the Amazon domain, and 8 examples per category for the DSLR and Webcam domains are randomly selected for training for each split. Also, 3 labeled examples are randomly selected for each category in the target domain for training for each split. The rest of the target samples are used for testing. We report some results in Table 1.

\begin{array}{|c|c|c|c|c|c|c|c|} \hline \mbox{Groups} & \mbox{Lower Bound} & \mbox{Tzeng 2014} & \mbox{Long 2015} & \mbox{Ghifary 2016} & \mbox{Tzeng 2015} & \mbox{Koniusz 2017} & \mathbf{CCSA} \\ \hline \mathcal{A} \rightarrow \mathcal{W}& 61.2 \pm 0.9 & 61.8 \pm 0.4 & 68.5 \pm 0.4& 68.7 \pm 0.3 & 82.7 \pm 0.8& 84.5 \pm 1.7& \mathbf{88.2 \pm 1.0}\\ \mathcal{A} \rightarrow \mathcal{D}& 62.3 \pm 0.8 & 64.4 \pm 0.3 & 67.0 \pm 0.4& 67.1 \pm 0.3 & 86.1 \pm 1.2& 86.3 \pm 0.8& \mathbf{89.0 \pm 1.2}\\ \mathcal{W} \rightarrow \mathcal{A}& 51.6 \pm 0.9 & 52.2 \pm 0.4 & 53.1 \pm 0.3& 54.09 \pm 0.5& 65.0 \pm 0.5& 65.7 \pm 1.7& \mathbf{72.1 \pm 1.0}\\ \mathcal{W} \rightarrow \mathcal{D}& 95.6 \pm 0.7 & 98.5 \pm 0.4 &\mathbf{99.0 \pm 0.2}& \mathbf{99.0 \pm 0.2} & 97.6 \pm 0.2& 97.5 \pm 0.7& 97.6 \pm 0.4\\ \mathcal{D} \rightarrow \mathcal{A}& 58.5 \pm 0.8 & 52.1 \pm 0.8 &54.0 \pm 0.4& 56.0 \pm 0.5 & 66.2 \pm 0.3& 66.5 \pm 1.0& \mathbf{71.8 \pm 0.5}\\ \mathcal{W} \rightarrow \mathcal{W}& 80.1 \pm 0.6 & 95.0 \pm 0.5 &96.0 \pm 0.3& \mathbf{96.4 \pm 0.3} & 95.7 \pm 0.5& 95.5 \pm 0.6& \mathbf{96.4 \pm 0.8}\\ \hline \mbox{Average} & 68.2 & 70.6 & 72.9 & 73.6 & 82.21 & 82.68 & \mathbf{85.8} \\ \hline \end{array}

Table 1. Office dataset. Classification accuracy for domain adaptation over the 31 categories of the Office dataset. A, W, and D stand for Amazon, Webcam, and DSLR domain. Lower Bound is our base model without adaptation.

In the second experiment we followed the setting described in 6 when only 10 target labeled samples of 15 classes of the Office dataset are available during training. Similar to 6, we compute the accuracy on the remaining 16 categories for which no target data was available during training.

Third experiment. We used the original train-test splits of the Office dataset3. The splits are generated in a similar manner to the first experiment but here instead, only 10 classes are considered: backpack, bike, calculator, headphones, keyboard, laptop-computer, monitor, mouse, mug, and projector. In order to compare our results with the state-of-the-art, we used DeCaF-fc6 features7 and 800-dimension SURF features as input. For DeCaF-fc6 features we used 2 fully connected layers with output size of 1024 (512) and 128 (32) with ReLU activation as the embedding function, and one fully connected layer with softmax activation as the prediction function. The features and splits are available on the Office dataset webpage.

### MNIST-USPS Dataset

The MNIST $\mathcal{M}$ and USPS $\mathcal{U}$ datasets contain images of digits from 0 to 9. We considered two cross-domain tasks, $\mathcal{M} \rightarrow \mathcal{U}$ and $\mathcal{U} \rightarrow \mathcal{M}$, and followed previously used experimental setting, which involves randomly selecting 2000 images from MNIST and 1800 images from USPS.

As Figure 3 Left shows the row images of the same class and different domains lie far away from each other in the 2D subspace, instead on the Right the solution using the SDA method.

## Domain Generalization

In Domain Generalization the goal is to show that CCSA is able to learn a domain invariant embedding subspace for visual recognition tasks. We tested two well known dataset: VLCS and MNIST.

For VLCS, we use images of 5 shared object categories: bird, car, chair, dog, and person, of the PASCAL VOC2007 $\mathcal{V}$, LabelMe $\mathcal{L}$, Caltech-101 $\mathcal{C}$, and SUN09 $\mathcal{(S)}$ datasets, which is known as VLCS dataset8. Our DG method has higher average performance. Also, note that in order to compare with the state-of-the-art DG methods, we only used 2 fully connected layers for our network and precomputed features as input.

For the MNIST dataset experiment We followed the setting in [24], and randomly selected a set M of 100 images per category from the MNIST dataset: 1000 in total9. We then rotated each image in M five times with 15 degrees intervals, creating five new domains. We conducted a leave-one-domain-out evaluation { 6 cross-domain cases in total ). We used the same network of Section 5.1.2, and we repeated the experiments 10 times. We obtain comparative average accuracies for CCSA with the state-of-the-art methods.

#### References

1. H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.

2. S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.

3. K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV, pages 213–226, 2010.  2

4. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

5. J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.

6. E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.  2 3

7. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: a deep convolutional activation feature for generic visual recognition. In arXiv:1310.1531, 2013.

8. C. Fang, Y. Xu, and D. N. Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In International Conference on Computer Vision, 2013

9. M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE International Conference on Computer Vision, pages 2551–2559, 2015.