In this section, we use our method for determining the scene geometry (*cf*. Section “Method“) to demonstrate that current state-of-the-art monocular 3D human pose estimation methods do not accurately account for the underlying scene geometry, resulting in significant discrepancies between the estimated 3D poses and the actual 3D poses. We demonstrate this discrepancy by swapping the implied scene geometry for the approximately correct geometry (using our method) and recording the resulting differences in the projected image (*cf*. Fig. 1).

In this experiment, we determine one fixed 2D skeleton for each of the evaluated frames/athletes. This 2D HPE method is based on a ResNet-50 backbone, trained on MPII^{12,21}. We lift each 2D pose into 3D using different monocular 3D HPE methods (see below). We place the resulting 3D skeletons into a simulated scene, and project this scene back into the 2D image, using the actual scene geometry, as determined with our method. We can now compare this reprojection to the original 2D skeleton.

In terms of Fig. 1: We take a 2D skeleton (c) and lift it into a 3D skeleton (d). This monocular 3D HPE process implied some unknown geometry (f). We use our method to determine the actual geometry of scene (e) and project the 3D skeleton (d) back into 2D image (c) using this correct geometry. We show that the resulting reprojection differs from the original 2D skeleton, which in turn must mean that the implied geometry (f) differs from the actual geometry (e) and that the estimated 3D pose (d) differs from the actual 3D pose (a).

We quantify the reprojection error and estimate the underlying 3D error that caused the reprojection error. We additionally provide evidence from a small real-world experiment that both our method closely approximates the correct scene geometry and that our approximation of implied 3D knee angle errors is reasonable.

We also provide additional adversarial experiments in the supplementary material that demonstrate that the below results are not just artifacts of the limited pinhole camera model. Lastly, we also copy the typical scene setup at our local track and compare our method to laser-verified groundtruth measurements in Section “Ground truth evaluation“.

### Annotations

We annotate frames of five video sequences from different venues, athletes, and distances from major broadcast athletic events (*e.g*. Olympic Games, Diamond League, …). For these, we manually go through all the frames to ensure that our calculated results are consistent with all visible clues in the scene. Using lane demarcations, our algorithm automatically generates an exhaustive set of candidate camera parameters. We can then determine the correct camera parameters for every single frame using an annotation tool that lets the annotator slide through the various plausible camera parameters until they perfectly align with all additional visual clues. Using the resulting scene geometry of each frame, we ray-trace the exact 3D location of the athlete whenever they touch the ground. We determine the frames that depict the touch-down phase of the athletes’ stride by analyzing the foot progression of the 2D human pose estimate. We scale the 3D skeleton and additionally scale and translate the 2D projection to minimize the distance to the original 2D pose. We only use athletes that are fully visible to avoid errors due to occlusions from other athletes. This process results in a total of 355 frames, which we evaluate in the following.

### Evaluated monocular 3D HPE methods

We compare 3 state-of-the-art methods for monocular 3D HPE: Strided Transformers^{18}, RIE^{17} and MeTRAbs^{21}. While the former two methods are solely trained on Human3.6m^{16}, MeTRAbs is additionally trained on external data and purposely build for 3D HPE in the wild. For all these algorithms, we run 3D HPE, then detect the absolute position of an athlete’s foot in the scene, and place the 3D skeleton at that location. The predicted orientation and scale of the 3D skeletons depend on the 2D/3D correspondences in the training data. As the 3D skeletons do not necessarily comply with the actual geometry of the scene and the orientation of the camera, we adjust the scale to match the height of the projection. We also align the orientation of the predicted 3D skeleton with the axes of the constructed scene (*cf*. Fig. 2).

A preliminary analysis showed that MeTRAbs has superior performance over the other two methods. We furthermore compare MeTRAbs to slightly improved versions of itself. We inject information into the base algorithm that is ordinarily not available to it. The purpose of these modifications is to show that there is still an offset between the projection of the 3D skeleton and the actual 2D HPE in the original image, even when we improve the method by leveraging additional domain and scene knowledge.

*MeTRAbs + movement knowledge*. We exclusively investigate running footage in which the athletes run down the home stretch. We, therefore, know that the 3D skeletons in the scene should always face in the same direction and are moving in a straight line. The pan motion of the camera following the athlete impacts the relative orientation of the athlete to the camera. This often results in the 3D lifting portion of monocular 3D HPE to describe a curved trajectory. Straightening out the athlete’s path leads to a first improvement, leveraging domain knowledge about the scene.

*MeTRAbs + rotation knowledge*. Secondly, we compare the base algorithm to an improvement strategy in which we ideally rotate the 3D skeleton using the relative orientation of the camera to the skeleton. We use the same rationale as before: The athletes should always be facing the same direction. Only this time, we directly place and rotate the athlete such that they are facing the finishing line. We can perform this rotation of the skeleton because we know where the camera is located relative to the athlete using our method described and therefore again inject domain knowledge.

Both of these improvements leverage information that is not available to the base algorithm.

### Evaluation metrics

Ideally, for a perfect 3D HPE algorithm, placing the 3D skeleton in the correctly derived global geometry of the scene and then projecting it into the image using the derived camera parameters should result in a perfect overlap of the 2D skeleton and the reprojected 3D skeleton. Realistically, there will always be some margin of error. In the following, we measure this error for existing state-of-the-art 3D HPE methods. We further investigate the expected error over a sample size of 16 athletes and videos from different camera angles and pan-zooming recordings resulting in 355 data points.

We do not have ground truth 3D HPE data for the investigated videos, so we cannot perform the typical analysis of 3D MPJPE (Mean Per Joint Position Error). Instead, the reprojection error as described above is expressed in the 2D image space. Additionally, for each of the studied athletes, we simulate a movement in their knee in 3D space and record the resulting changes in the projected image. Below is a detailed description of our evaluation metrics; the results can be found in Table 1. We consider 17 major joint locations, commonly used across HPE benchmarks: head, neck, chest, navel, pelvis, 2\(\times \) shoulder, 2\(\times \)elbow, 2\(\times \)wrist, 2\(\times \)hip, 2\(\times \)knee and 2\(\times \)ankle.

*Reprojection error*. Using the uncovered 3D geometry of the scene, we project the 3D skeleton into the image and compute the average per-joint offset to the corresponding 2D skeleton in pixels. For this, we use the 17 default human3.6m joint definitions^{16}. Additionally, since the exact scene geometry is known (track lanes are \(1.22\pm 0.01\) m wide), we scale this value by the real-word versus pixel height of the athlete. This is not the correct joint distance in mm, but just an approximation incorporating the image scale. For a true distance measure, we would require ground truth 3D skeleton information. We include this measure as it more accurately accounts for the distance of the athlete to the camera and the camera’s zoom.

*2D knee error*. For kinematic investigations, we are not particularly interested in the absolute position of each of the joints, but rather their relation to each other. As motivated in Section “Related work“, we want to investigate the knee angles of the athlete. Since we do not know the correct 3D skeletons or detailed running kinematics in our test data, we measure the knee angle error for the 2D poses.

*Approx. 3D knee error*. If we had an orthogonal view of the athlete’s knee, the visible 2D knee angle (and its error) would roughly correspond to the 3D knee angle. For increasingly steep angles of the camera towards the sagittal plane of the athlete though, this correspondence breaks. For larger values of the camera’s azimuth, 2D knee errors result in more severe actual knee errors. We approximate the 3D knee error by simulating movement in the predicted 3D skeleton’s knee and recording its effect on the 2D knee angle error. The measured 2D knee error is then scaled accordingly for each of the evaluated frames.

The best approximated 3D knee error in our comparison is 8.45\(^{\circ }\) with a standard deviation of 13.19\(^{\circ }\) (*cf*. Table 1). This margin of error is larger than the levels of change for significant differences in running kinematics and implied running economy as detailed in the literature^{5,7,9,31} (*cf*. Section “Related work“), rendering the current state-of-the-art methods infeasible for the collection of data towards kinematic investigations.

### Ground truth evaluation

We perform a small validation study using an Xsens motion capture suit (MVN Link, Xsens Technologies B.V., Enschede, Netherlands, https://www.xsens.com/products/mvn-analyze). This IMU-based motion capture system has been independently validated with angle errors of \(< 2.6 \pm 1.5^{\circ }\)^{44}. We set up multiple camera locations in the stands close to the finish line, to match typical broadcast images and triangulate the cameras’ positions using both a laser range finder and optical methods. The image used in Figs. 1, 2 and 3 display a still from our own video recordings, which is representative of the positioning and settings of broadcast video. In our experiment, one athlete runs on the home stretch of the track and we simultaneously record 3D motion capture and video footage, performing the TV-typical camera operations: pan, tilt, and zoom (up to 30x). We use the method described in Section “Method” to extract possible camera parameters. For 50 frames, we manually pick the camera parameters that best align the projected 3D skeleton with the athlete’s image. Speaking in terms of Fig. 1, we recorded the actual 3D pose Fig. 1a, annotated scene geometry Fig. 1e, and filmed the skeleton Fig. 1c. We now compare these to the estimated pose Fig. 1d.

First, we evaluate how well our method predicts the actual geometry and location of the camera in the scene. Our model is based on a pinhole camera, whereas in reality, we filmed with a conventional camera with multiple lenses. We, therefore, cannot expect our method to find the exact position of the real camera, but only of a virtual camera. We find that the predicted camera location is within 5.5% of the correct camera position (w.r.t. the distance of the camera to the athlete). The average offset of the predicted camera to the actual camera in x/y/z direction is 1.75m/2.67m/0.72m (min: 0.09m/0.18m/0.01m, max: 4.48m/9.50m/2.05m). The athlete is at an average distance of 37.56m to the camera (min: 14.57m, max: 71.41m).

We next compare the 2D HPE for these 50 frames to the projection of the recorded 3D Xsens skeleton using the correct scene geometry, resulting in an RMSE of \(7.56 \pm 3.75\) pixel, which equals \(50.42 \pm 28.51\) mm.

Finally, we evaluate the 3D angle error in the knee and elbow between the recorded and estimated skeletons (*cf*. Fig. 1a,d). The left/right knee have an average error of: \(8.39 \pm 4.41^{\circ }\) / \(7.94\pm 5.84^{\circ }\), which is in line with our approximated 3D error in Table 1. The left/right elbow has an average error of: \(15.81\pm 7.80^\circ \) / \(11.85 \pm 5.65^\circ \), yielding an overall expected error in 3D angle prediction of \(11.00 \pm 5.93^\circ \).