The paper highlights the ever increasing complexity in the evaluation of IR systems which has arisen over the last decade. Relevance, cognition, user behaviour, interaction, and a changing view of the boundaries of the system are considered to be contributory factors. Issues such as laboratory versus operational systems, black-box versus diagnostic experiments, and qualitative and quantitative methods are discussed and supported by examples drawn from three groups of evaluative experiments: weighted searching on a front-end system, information-seeking behaviour and the use of OPACs, and the OKAPI experimental retrieval system.