Affective states play a crucial role in learning. Existing Intelligent Tutoring Systems (ITSs) fail to track affective states of learners accurately. Without an accurate detection of such states, ITSs are limited in providing truly personalized learning experience. In our longitudinal research, we have been working towards developing an empathic autonomous 'tutor' closely monitoring students in real-time using multiple sources of data to understand their affective states corresponding to emotional engagement. We focus on detecting learning related states (i.e., 'Satisfied', 'Bored', and 'Confused'). We have collected 210 hours of data through authentic classroom pilots of 17 sessions. We collected information from two modalities: (1) appearance, which is collected from the camera, and (2) context-performance, that is derived from the content platform. The learning content of the content platform consists of two section types: (1) instructional where students watch instructional videos and (2) assessment where students solve exercise questions. Since there are individual differences in expressing affective states, the detection of emotional engagement needs to be customized for each individual. In this paper, we propose a hierarchical semi-supervised model adaptation method to achieve highly accurate emotional engagement detectors. In the initial calibration phase, a personalized context-performance classifier is obtained. In the online usage phase, the appearance classifier is automatically personalized using the labels generated by the context-performance model. The experimental results show that personalization enables performance improvement of our generic emotional engagement detectors. The proposed semi-supervised hierarchical personalization method result in 89.23% and 75.20% F1 measures for the instructional and assessment sections, respectively.