In today's information and computational society, complex systems are often modeled as multi-modal networks associated with heterogeneous structural relation, unstructured attribute/content, temporal context, or their combinations. The abundant information in multi-modal network requires both a domain understanding and large exploratory search space when doing feature engineering for building customized intelligent solutions in response to different purposes. Therefore, automating the feature discovery through representation learning in multi-modal networks has become essential for many applications. In this tutorial, we systematically review the area of multi-modal network representation learning, including a series of recent methods and applications. These methods will be categorized and introduced in the perspectives of unsupervised, semi-supervised and supervised learning, with corresponding real applications respectively. In the end, we conclude the tutorial and raise open discussions. The authors of this tutorial are active and productive researchers in this area.