Latent space domain transfer between high dimensional overlapping distributions

Sihong Xie, Wei Fan, Jing Peng, Olivier Verscheure, Jiangtao Ren

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

23 Citations (Scopus)

Abstract

Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of "missing values"when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by "filling up" those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the "latent-space" as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as "queries", the mostly related labeled instances from outdomain are retrieved, and the classification is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latentsemantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples. Software and datasets are available for download. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Original languageEnglish
Title of host publicationWWW'09 - Proceedings of the 18th International World Wide Web Conference
Pages91-100
Number of pages10
DOIs
StatePublished - 1 Dec 2009
Event18th International World Wide Web Conference, WWW 2009 - Madrid, Spain
Duration: 20 Apr 200924 Apr 2009

Publication series

NameWWW'09 - Proceedings of the 18th International World Wide Web Conference

Other

Other18th International World Wide Web Conference, WWW 2009
CountrySpain
CityMadrid
Period20/04/0924/04/09

Fingerprint

World Wide Web

Keywords

  • Algorithms

Cite this

Xie, S., Fan, W., Peng, J., Verscheure, O., & Ren, J. (2009). Latent space domain transfer between high dimensional overlapping distributions. In WWW'09 - Proceedings of the 18th International World Wide Web Conference (pp. 91-100). (WWW'09 - Proceedings of the 18th International World Wide Web Conference). https://doi.org/10.1145/1526709.1526723
Xie, Sihong ; Fan, Wei ; Peng, Jing ; Verscheure, Olivier ; Ren, Jiangtao. / Latent space domain transfer between high dimensional overlapping distributions. WWW'09 - Proceedings of the 18th International World Wide Web Conference. 2009. pp. 91-100 (WWW'09 - Proceedings of the 18th International World Wide Web Conference).
@inproceedings{d80d7853026649f8bc9f2ad6314f8353,
title = "Latent space domain transfer between high dimensional overlapping distributions",
abstract = "Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of {"}missing values{"}when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by {"}filling up{"} those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the {"}latent-space{"} as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as {"}queries{"}, the mostly related labeled instances from outdomain are retrieved, and the classification is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latentsemantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples. Software and datasets are available for download. Copyright is held by the International World Wide Web Conference Committee (IW3C2).",
keywords = "Algorithms",
author = "Sihong Xie and Wei Fan and Jing Peng and Olivier Verscheure and Jiangtao Ren",
year = "2009",
month = "12",
day = "1",
doi = "10.1145/1526709.1526723",
language = "English",
isbn = "9781605584874",
series = "WWW'09 - Proceedings of the 18th International World Wide Web Conference",
pages = "91--100",
booktitle = "WWW'09 - Proceedings of the 18th International World Wide Web Conference",

}

Xie, S, Fan, W, Peng, J, Verscheure, O & Ren, J 2009, Latent space domain transfer between high dimensional overlapping distributions. in WWW'09 - Proceedings of the 18th International World Wide Web Conference. WWW'09 - Proceedings of the 18th International World Wide Web Conference, pp. 91-100, 18th International World Wide Web Conference, WWW 2009, Madrid, Spain, 20/04/09. https://doi.org/10.1145/1526709.1526723

Latent space domain transfer between high dimensional overlapping distributions. / Xie, Sihong; Fan, Wei; Peng, Jing; Verscheure, Olivier; Ren, Jiangtao.

WWW'09 - Proceedings of the 18th International World Wide Web Conference. 2009. p. 91-100 (WWW'09 - Proceedings of the 18th International World Wide Web Conference).

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

TY - GEN

T1 - Latent space domain transfer between high dimensional overlapping distributions

AU - Xie, Sihong

AU - Fan, Wei

AU - Peng, Jing

AU - Verscheure, Olivier

AU - Ren, Jiangtao

PY - 2009/12/1

Y1 - 2009/12/1

N2 - Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of "missing values"when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by "filling up" those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the "latent-space" as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as "queries", the mostly related labeled instances from outdomain are retrieved, and the classification is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latentsemantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples. Software and datasets are available for download. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

AB - Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of "missing values"when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by "filling up" those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the "latent-space" as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as "queries", the mostly related labeled instances from outdomain are retrieved, and the classification is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latentsemantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples. Software and datasets are available for download. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

KW - Algorithms

UR - http://www.scopus.com/inward/record.url?scp=77954575574&partnerID=8YFLogxK

U2 - 10.1145/1526709.1526723

DO - 10.1145/1526709.1526723

M3 - Conference contribution

SN - 9781605584874

T3 - WWW'09 - Proceedings of the 18th International World Wide Web Conference

SP - 91

EP - 100

BT - WWW'09 - Proceedings of the 18th International World Wide Web Conference

ER -

Xie S, Fan W, Peng J, Verscheure O, Ren J. Latent space domain transfer between high dimensional overlapping distributions. In WWW'09 - Proceedings of the 18th International World Wide Web Conference. 2009. p. 91-100. (WWW'09 - Proceedings of the 18th International World Wide Web Conference). https://doi.org/10.1145/1526709.1526723