Wide-area indoor people detection in a network of depth sensors is the basis for many applications, e.g. people counting or customer behavior analysis. Existing probabilistic methods use approximative stochastic inference to estimate the marginal probability distribution of people present in the scene for a single time step. In this work we investigate how the temporal context, given by a time series of multi-view depth observations, can be exploited to regularize a mean-field variational inference optimization process. We present a probabilistic grid based dynamic model and deduce the corresponding mean-field update regulations to effectively approximate the joint probability distribution of people present in the scene across space and time. Our experiments show that the proposed temporal regularization leads to a more robust estimation of the desired probability distribution and increases the detection performance.