Understanding indoor scene structure from a single RGB image is useful for a wide variety of applications ranging from the editing of scenes to the mining of statistics about space utilization. Most efforts in scene understanding focus on extraction of either dense information such as pixel-level depth or semantic labels, or very sparse information such as bounding boxes obtained through object detection. In this paper we propose the concept of a scene map, a coarse scene representation, which describes the locations of the objects present in the scene from a top-down view (i.e., as they are positioned on the floor), as well as a pipeline to extract such a map from a single RGB image. To this end, we use a synthetic rendering pipeline, which supplies an adapted CNN with virtually unlimited training data. We quantitatively evaluate our results, showing that we clearly outperform a dense baseline approach, and argue that scene maps provide a useful representation for abstract indoor scene understanding.