The human pose estimation experiments are based on DEKR. We train DEKR on coco, crowdpose and SynPose respectively, and then test it on the real dataset. The real dataset comes from classroom surveillance videos of Shanghai Jiao Tong University, which contains classrooms of different sizes, and we divide them into large, medium and small classrooms. Each classroom has three surveillance cameras, two from the front and one from the back.

For a fair comparison, we generated two subsets, SynPose14 and SynPose17, which have the same data size and number of joints as the training sets of Crowdpose and COCO, respectively. We use HRNet-W32 as a backbone and set the max training epoch number to 100. The following images show the visualization of the model in a real classroom scenario after training with COCO, crowdpose and two subsets of SynPose after CTGAN, respectively.

It can be seen that the model trained with our dataset can extract the poses of students sitting in the corner and has a better performance in large classrooms with a dense distribution of students. In addition, the model trained with SynPose performs well in images taken from behind due to the rich diversity of viewpoints.

Limitations.
The model trained with SynPose has a substantial improvement over the model trained with COCO and crowdpose in most scenarios. However, due to the bias in our dataset, there are a few cases of inaccurate estimation of students in a standing position. In addition, model trained with SynPose in extremely low-resolution or dimly lit classroom scenes has limited performance improvement. Based on this, an upgraded dataset with more poses and interactions, more environmental factors and resolution is under construction and will come soon.