Most representation learning algorithms for language and image processing are local, in that they identify features for a data point based on surrounding points. Yet in language processing, the correct meaning of a word often depends on its global context. As a step toward incorporating global context into representation learning, we develop a representation learning algorithm that incorporates joint prediction into its technique for producing features for a word. We develop efficient variational methods for learning Factorial Hidden Markov Models from large texts, and use variational distributions to produce features for each word that are sensitive to the entire input sequence, not just to a local context window. Experiments on part-of-speech tagging and chunking indicate that the features are competitive with or better than existing state-of-the-art representation learning methods.