Visual object tracking in video can be formulated as a time varying appearance-based binary classification problem. Tracking algorithms need to adapt to changes in both foreground object appearance as well as varying scene backgrounds. Fusing information from multimodal features (views or representations) typically enhances classification performance without increasing classifier complexity when image features are concatenated to form a high-dimensional vector. Combining these representative views to effectively exploit multimodal information for classification becomes a key issue. We show that the Kullback-Leibler (KL) divergence measure provides a framework that leads to family of techniques for fusing representations including Cher-noff distance and variance ratio that is the same as linear discriminant analysis. We provide experimental results that corroborate well with our theoretical analysis.