Recently there is an increasing interest in video based interface techniques, allowing more natural interaction between users and systems than common interface devices do. Here, we present a neural architecture for user localisation, embedded within a complex system for visually-based human-machine-interaction (HMI). User's localisation is an absolute prerequisite to video-based HA II. Due to the main objective, the greatest possible robustness of the localisation as well as the whole visual interface under highly varying environmental conditions, we propose a multiple cue approach. This approach combines the features facial structure, head-shoulder-contour, skin color; and motion, with a multiscale representation. The selection of that image region most likely containing a possible user is then realised via a WTA-process within the multiscale representation. Preliminary results show the reliability of the multiple cue approach.