From what's described in Kinect for XBox 360 - The innovation Journey, a basic understanding could be:
1- Segment a human shape from the 3d depth (this could be easily done with some heuristic like, get closer pixels from the camera and run some blob detection...). Let's just assume we can start with this human silhouette segmented.
2- For each pixel, run this classifier that will tell what's the most probable part of the 32 body parts it belongs to. Check slide 57 of Kinect for XBox 360 - The innovation Journey. This is accomplished to idea similar to 'TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation'.
3- In order to get training data for that, one easy way would be to use Unity3d with some mocap (more on this later), or maybe Blender or other freely available modeling software.
4- Let's say we've achieved a result similar to slide 57 previously mentioned. From that, we can obtain the average depth for each of the 32 parts. This is exactly the skeleton we are looking for.