Much work has been done on the question of how the visual system extracts the three-dimensional (3D) structure and motion of an object from two-dimensional (2D) motion information, a problem known as ‘Structure from Motion’, or SFM. Much less is known, however, about the human ability to recover structure and motion when the optic flow field arises from multiple objects, although observations of this ability date as early as Ullman’s well-known two-cylinders stimulus [The interpretation of visual motion (1979)]. In the presence of multiple objects, the SFM problem is further aggravated by the need to solve the segmentation problem, i.e. deciding which motion signal belongs to which object. Here, we present a model for how the human visual system solves the combined SFM and segmentation problems, which we term SSFM, concurrently. The model is based on computation of a simple scalar property of the optic flow field known as def, which was previously shown to be used by human observers in SFM. The def values of many triplets of moving dots are computed, and the identification of multiple objects the image is based on detecting multiple peaks in the histogram of def values. In five experiments, we show that human SSFM performance is consistent with the predictions of the model. We compare the predictions of our model to those of other theoretical approaches, in particular those that use a rigidity hypothesis, and discuss the validity of each approach as a model for human SSFM.