Many visual tasks are carried out by using multiple sources of sensory information to estimate environmental properties. In this paper, we present a model for how the visual system combines disparity and velocity information. We propose that, in a first stage of processing, the best possible estimate of the affine structure is obtained by computing a composite score from the disparity and velocity signals. In a second stage, a maximum likelihood Euclidean interpretation is assigned to the recovered affine structure. In two experiments, we show that human performance is consistent with the predictions of our model. The present results are also discussed in the framework of another theoretical approach of the depth cue combination process termed Modified Weak Fusion.