Notice: MediaWiki has been updated. Report any rough edges to marcan@marcan.st
Protocol Documentation
OpenKinect Collaboration
Please check back often, as additional insight and knowledge is gained it will be documented here.
Links
- https://github.com/adafruit/Kinect/blob/master/usbmotor.py - Python motor control code - does the "move up and down" action.
- https://gist.github.com/670533 - Sample python code
- https://github.com/adafruit/Kinect/tree/master/USBlogs/ - Some USB logs, curtosy of adafruit, hosting on the wonderful large binary file distribution site, github, for easy downloading.
USB Communication
USB Communication is currently under development, with most communication taking place using pyusb, and libusb.
Devices
- 04 - NUI Motor
- 07 E1 - NUI Camera - RGB
- 07 E2 - NUI Camera - Depth
- 06 - NUI Audio
NB Do not assume that Motor and Audio devices will always be present, as it is possible to run the camera board standalone.
Control Packet Structure
USB control messages are used to read accelerometer values, set Motor/LED status and camera registers. More information on the basic structure of these packets at http://www.beyondlogic.org/usbnutshell/usb6.shtml#SetupPacket
Control Transfer (8-bytes) Request: RequestType (1 byte) Request (1 byte) Value (2 bytes) Index (2 bytes) Length (2 bytes)
The common values used for the RequestType field when talking to a Kinect are:
0x80 (LIBUSB_REQUEST_TYPE_STANDARD | LIBUSB_RECIPIENT_DEVICE | LIBUSB_ENDPOINT_IN) 0x40 (LIBUSB_REQUEST_TYPE_VENDOR | LIBUSB_RECIPIENT_DEVICE | LIBUSB_ENDPOINT_OUT) 0xc0 (LIBUSB_REQUEST_TYPE_VENDOR | LIBUSB_RECIPIENT_DEVICE | LIBUSB_ENDPOINT_IN)
For read packets (RequestType 0x80 and 0xc0) Length is the length of the response.
Motor Initialization
Verify readiness? Send: 0xC0, 0x10, 0x0000, 0x0000, 0x0001 Response: 0x22
The joint range of the Kinect motor is +31 degrees (up) and -31 degrees (down).
Tilting the Camera Up and Down
The pitch (up/down angle) of the camera can be set by sending:
type request value index data length 0x40 0x31 2*desired_angle_degrees 0x0 empty 0
The desired angle is relative to the horizon, not relative to the base. Kinect uses it's accelerometers to detect when it has reached the correct angle, and then stops. So if the Kinect is on a hill, and you set the desired angle to 0 degrees, the camera will become level, not parallel to the hill.
The value is actually in the range -128 to +128, where -128 is a bit less (more positive) than -90 degrees, and +128 is a bit less than +90 degrees. So, it is not exactly 2 x the desired angle. The mapping is a bit rough, presumably because the accelerometers aren't calibrated and accelerometers never perform identically. The value corresponds to the 9th byte of the 10-byte accelerometer report, which reports the current angle from -128 to +128.
Warning: Sending the motor past 31 degress relative to the base has been shown to cause stalling of the motor. There is no way of knowing the angle relative to the base, since angles are measured relative to the horizon! There are no automatic safe-gaurds to prevent this kind of command from being sent. And manual safeguards by restricting the range of values sent are impossible, since we don't know the angle of the base. To prevent this, it is necessary to monitor the 10-byte accelerometer report. The last byte of the accelerometer report is the motor status, where 0 means stationary and OK, 1 means stopped because the motor couldn't go any further, 4 means moving like normal, 8 means taking a quick break and about to try again because the motor couldn't go further. The second byte (out of 10) tells you the current level of strain on the motor, which applications can monitor. 0 means no strain. Due to the low torque and power consumption no damage is likely but this should be avoided to ensure maximum component life.
Setting LED
The led can be set by sending:
request request value index data length 0x40 0x06 led_option 0x0 empty 0
where the led_options are enumerated as follows:
LED_OFF = 0, LED_GREEN = 1, LED_RED = 2, LED_YELLOW = 3, (actually orange) LED_BLINK_YELLOW = 4, (actually orange) LED_BLINK_GREEN = 5, LED_BLINK_RED_YELLOW = 6 (actually red/orange)
Other colours are possible by rapidly swapping the value to 2 different colours hundreds of times per second.
Reading Joint State
The joint state information is grouped in with the accelerometer data and is stored in the 8th and 9th byte of the return from:
request request value index data length 0xC0 0x32 0x0 0x0 buf 10
the 8th byte (buf[8]) yields:
positive_angle_degrees = value/2 negative_angle_degrees = (255-value)/2
buf[8] = 0x80 if the kinect is moving (buf[9] is usually 0x04, but sometimes 0x00)
Please note that this is not the angle of the motor, this is the angle of the kinect itself in degrees (basically accelerometer data translated)
the 9th byte (buf[9]) yields the following status codes:
0x0 - stopped 0x1 - reached limits 0x4 - moving
Reading Accelerometer
The accelerometer data is stored in two byte pairs for x,y, and z:
ux = ((uint16_t)buf[2] << 8) | buf[3]; uy = ((uint16_t)buf[4] << 8) | buf[5]; uz = ((uint16_t)buf[6] << 8) | buf[7];
The Accelerometer documentation (http://www.kionix.com/Product%20Sheets/KXSD9%20Product%20Brief.pdf) states there are 819 counts/g
Cameras
The depth camera returns values with 11-bits of precision.
RGB data should follow similar conventions but will need to be analyzed. RGB frames are significantly bigger and encoded using a Bayer pattern.
Relevant Bits of Code
The most important piece of data gathered from the libkinect project is the following struct:
struct __attribute__ ((packed)) frame_info { uint8_t magic[2]; //"RB" uint8_t control; // 00 - means incoming data, other values (if any) are control codes uint8_t cmd; // if control=0, 71 or 81- new frame, 72 or 82- current frame, 75 or 85 - EOF (7x-depth, 8x-color) uint8_t SeqNum; uint8_t pkt_seq; uint8_t LengthHigh; // Length of Packet (High Byte) uint8_t LengthLow; // Length of Packet (Low Byte) uint32_t time_stamp; // all 0 on new frame. packets for one frame has the same timestamp uint8_t data[]; }; int _frame_pos; uint8_t _frame[422400];
This is a 12-byte struct followed with a data array and forms the basis of the Kinect's depth/color camera perception protocol. As you can tell we definitely have some magic inside this structure. Some of them are most likely control/status commands and will be determined later on. The _frame_pos and _frame are private member variables which work to effectively process the data.
To process an image we do the following, after pushing the data into the above struct:
- If cmd == 0x71 we have reached the end of the current frame and need to go onto the next frame. Set our _frame_pos pointer to 0.
- We get the length of the data array. This is the total length of the packet subtracting the length of the header information in the above struct.
- We memcpy the data array into our _frame array, using _frame_pos as an base index position.
- If cmd == 0x75 we have reached the end of the current frame, time to shoot it to a output of some sort.
There are 242 packets for one frame for depth camera (including 0x71 and 0x75 packets). All packets are 1760 bytes except 0x75 packet - 1144 bytes. Minus headers it gives 422400 bytes of data.
There are 162 packets for one frame for color camera (including 0x81 and 0x85 packets). All packets are 1920 bytes except 0x85 packet - 24 bytes. Minus header it gives 307200 bytes of data.
Possible improvements to this algorithm could include monitoring packet_seq and timecode for sequential data packets. Obviously if the data is corrupt you have bigger problems but it might be nice. Need to figure out the magic.
After we have the data, for basic display in the QWidget and to generate proper RGB pixels:
- Reverse the endianness of the data. (0b10000000 becomes 0b00000001)
- Shove it into an RGB bitstream, since we're only dealing with monochrome data it's just duplicated three times for convienience.
As a note there's some bit manipulation going on to get it into a pixel value. Each pixel is 11 bits, which gives it 2047 possible values it looks like. Kind of nifty. You can see the bit-shifting going on inside the libkinect source, documented here for prosperity:
for (int p = 0, bit_offset = 0; p < 640*480; p++, bit_offset += 11) {
uint32_t pixel = 0; // value of pixel
pixel = *((uint32_t *)(data+(p*11/8)));
pixel >>= (p*11 % 8);
pixel &= 0x7ff;
uint8_t pix_low = (pixel & 0x00ff) >> 0;
uint8_t pix_high = (pixel & 0xff00) >> 8;
pix_low = reverse[pix_low];
pix_high = reverse[pix_high];
pixel = (pix_low << 8) | (pix_high);
pixel >>= 5;
// Image drops the 3 MSbs
rgb[3*p+0] = 255-pixel;
rgb[3*p+1] = 255-pixel;
rgb[3*p+2] = 255-pixel;
}
Looks fairly routine, but I'll be the first to admit bit-wise operations make my head hurt, doubly so for efficient bit-wise operations.
RGB Camera
The RGB Camera follows the same frame format as the depth camera, with a minor difference of cmd being 0x8x instead of 0x7x.
The frame output of the RGB camera is a 640x480 Bayer pattern, so the total frame size is 640*480=307200 bytes.
The Bayer pattern is layed out as: RG, GB.
Control EP
Control endpoint uses different header
struct control_hdr {
uint8_t magic[2]; //can be "RB" for Input and "GM" for Output (command)
uint16_t len; //data length in words
uint16_t cmd;
uint16_t cmd2_or_tag; //cmd and tag from GM packet should be matched in response RB packet
uint16_t data[];
};
Control point is most likely inits the IR projector and cameras
Control Commands
Here are some of the commands we've fuzzed or found in USB logs.
Setting Parameters
control_hdr.cmd = 0x0003;
control_hdr.tag = NONCE;
control_hdr.len = 0x0002;
uint16_t data[] = { ParameterID, value };
Reading Parameters
control_hdr.cmd = 0x0002;
control_hdr.tag = NONCE;
control_hdr.len = 0x0001;
uint16_t data[] = { ParameterID, 0x0000 };
Replies
The reply returns the original packet header with the magic bits set to RB. You must do a USB read after you've written the command to ask for the reply. The first uint16_t of the reply will be the status of the command. 0x0000 is success, 0x0005 is failure. On a read, it will return the currently set value as the second uint16_t. When reads fail with 0x0005, we are assuming the command does not exist.
Note that windows errovalue 5 is ERROR_ACCESS_DENIED. Not sure if that's relevant.
Parameter | Default | Valid Range | Behavior |
---|---|---|---|
0x0000 | 0x01 | Replies 05 00 01 00 unless written value > 2, in which case the IR cam turns off and glview hangs. | |
0x0005 | 0x00 | Color Stream Control
0: Disable stream 1: Open RGB Stream 2: ? 3: Open IR Stream | |
0x0006 | 0x00 | Depth Stream Control
0: Disable stream 1: (also opens depth stream) 2: Open Depth Stream 3: (also opens depth stream) | |
0x000c | 0x00 | 0x0000-0x0005 | RGB Image Format
0x0000 = Bayer 0x0001 = Compressed UYVY. Must be used with 15hz framerate. marcan's python decompressor 0x0002 = ? 0x0003 = ? 0x0004 = ? 0x0005 = UYVY. Must be used in conjunction with 15hz framerate |
0x000d | 0x00 | RGB Image Resolution
0: small 1: standard (640x480) 2: full (1280x1024) (must be used with "15" hz framerate) | |
0x000e | 0x00 | 15, 30 | RGB Framerate
0x1e (30): 30 fps 0x0f (15): 15 fps (~10 when in 1280x1024 video mode) |
0x0011 | 0x01 | 0, 1 | Nothing visible happens, but the only accepted input values are 0 and 1 |
0x0012 | 0x01 | Depth Stream Format
0: uncompressed 16 bit depth stream between 0x0000 and 0x07ff (2^11-1). Causes bandwidth issues; will drop packets. 1: differential/RLE compressed 11 bit depth stream 2: 10-bit stream 3: 11-bit stream | |
0x0013 | 0x01 | 0x0000-0x0002 | Depth Stream Resolution
0: small 1: standard (640x480) 2: lots of data - haven't gotten a coherent frame out of it. |
0x0014 | 0x00 | ?, 30 | Depth Framerate
0x1e (30): 30 fps |
0x0015 | 0x1e | Send a 0, it replies 05 00 00 00 (not 05 00 01 00 as with most other commands). Values 1-50 return success, others 05 00 00 00. Changes the brightness of the IR-Image. With 50=highest brightness 1=lowest brightness | |
0x0016 | 0x0001 | 0, 1 | Depth Smoothing (hole-filling)
LSB = 0: Smoothing Disabled LSB = 1: Smoothing Enabled (default) |
0x0017 | 0x0000 | 0, 1 | Depth H-Flip
0: Regular feed 1: Flipped Horizontally |
0x0019 | 0x0000 | 0x0000 | IR Stream Format (unconfirmed, inferred from RGB & depth commands)
All values appear to give 10-bit packed luminance values. |
0x001a | 0x01 | 0x0000-0x0002 | IR Stream Resolution
0: small 1: standard (640x488) 2: full (1280x1024) . |
0x001b | 0x00 | 15, 30 | IR Framerate
0x0f (15): 15 fps (or ~9, when in high-res IR) 0x1e (30): 30 fps |
0x0024 | 0x01 | Unknown, but has nonzero default value | |
0x002d | 0x01 | Unknown, but has nonzero default value | |
0x0047 | 0x00 | RGB H-Flip
0: Regular feed 1: Flipped Horizontally Note that when flipped, the data keeps the same Bayer pattern, so the demosaicing looks less good. | |
0x0048 | 0x00 | IR H-Flip
0: Regular feed 1: Flipped Horizontally Note that since the depth buffer is computed from the IR image, enabling IR hflip will render the depth image useless. This is probably not something that you want. | |
0x0100 | 0x0001 | 0x0000-0x0001 | unknown function |
0x0101 | 0x0000 | 0x0000-0xFFFF | unknown function |
0x0102 | 0x0000 | 0x0000-0x0001 | unknown function |
0x0103 | 0x008d (Interesting: mine reads 0x008a - zarvox) | All attempted values fail to set. | |
0x0104 | 0x012c | 0x0000-0xFFFF | unknown function |
0x0105 | 0x0000 (mine reads 0x005a - zarvox) | 0x0000-0xFFFF | If nonzero, IR projector will cycle and pause the depth stream when too much of the IR pattern is missing from the internally-processed IR image. |
0x0106 | 0x01f4 | 0x0000-0xFFFF | unknown function |
0x0107 | 0x0bb8 | 0x0000-0xFFFF | unknown function |
0x0108 | 0x0000 | 0x0000-0x0001 | unknown function |
0x0109 | 0x002a | 0x0000-0xFFFF | unknown function |
0x010a | 0x001b | 0x0000-0xFFFF | unknown function |
0x010b | 0x0008 | 0x0000-0xFFFF | unknown function |
0x010c | 0x0003 | 0x0000-0xFFFF | unknown function |
0x010d | 0x00fa | 0x0000-0x???? | Setting to 0xFFFF crashes the command endpoint |
0x010e | 0x0004 | 0x0000-0x00FF | unknown function |
0x010f | 0x2710 | 0x0000-0xFFFF | unknown function |
0x0110 | 0x0004 | 0x0000-0xFFFF | unknown function |
0x0111 | 0x0008 | 0x0000-0xFFFF | unknown function |
0x0112 | 0x1388 | 0x0000-0xFFFF | unknown function |
0x0113 | 0x0078 | 0x0000-0xFFFF | unknown function |
0x0114 | 0x03e8 | 0x0000-0xFFFF | unknown function |
0x0115 | 0x3a98 | 0x0000-0xFFFF | unknown function |
0x0116 | 0x0064 | 0x0000-0x00FF | unknown function |
0x0117 | 0x00b7 | 0x0000-0x00FF | unknown function |
0x0118 | 0x006c | 0x0000-0x00FF | unknown function |
0x0119 | 0x00ca | 0x0000-0x00FF | unknown function |
0x011a | 0x00f5 | 0x0000-0x00FF | unknown function |
0x011b | 0x0027 | 0x0000-0x00FF | unknown function |
0x011c | 0x0005 | unknown function |
Further parameter discovery is still a work-in-progress.
Color CMOS Camera Register Access
The USB dumps show us that the xbox is modifying the color CMOS sensor's registers manually. The sensor is very similar to the mt9v112: [1], but has a larger sensor and some of the registers don't align with the above document.
RGB Camera config strings
Header:
control_hdr.cmd = 0x0095;
control_hdr.tag = NONCE;
control_hdr.len = sizeof(data) / sizeof(uint16_t);
There appear to be a number of strings that affect the RGB feed that take the format:
uint16_t data[] = { RegisterCount, Address0, Value0, Address1, Value1, Address2, Value2 ... };
RegisterCount Is the number of Address/Value pairs in the command. The maximum number seems to be 0xC (12).
To write a register, set the high bit (0x8000) of the address. To read a register, do not set the high bit.
Here are a couple of raw strings from the USB dump:
0c0021800080048000050380000407808e0208805f000b80460039821605578264025882e0025c8210155d82151a3b82e604
0c000280680001801c002581050005810300478130109d81ae3c5381102054814060558180a05681c0d05781e0f0588100ff
0C0005810100478130109D81AE345381102054814060558180A05681C0D05781E0F0588100FF068182742E8244102F820091
The camera's register values alias every 0x400 register addresses. baby-rabbit made a dump of all non-zero camera register-value pairs pasted here.
We're not 100% sure what everything is, but here's what we think we know.
Address | Default | Behavior |
---|---|---|
0x0000 | 0x148C | Sensor Core R0:0—0x000 – Chip Version (Read Only) |
0x0001 | 0x001C | Sensor Core R1:0—0x001 – Row Start
The first row to be read out (not counting dark rows that may be read). To window the image down, set this register to the starting Y value. Setting a value less than 8 is not recommended since the dark rows should be read using Reg0x022. |
0x0002 | 0x0068 | Sensor Core R2:0—0x002 – Column Start
The first column to be read out (not counting dark columns that may be read). To window the image down, set this register to the starting X value. Setting a value below 0x18 is not recommended since readout of dark columns should be controlled by Reg0x022. |
0x0003 | 0x0400(1024) | Sensor Core R3:0—0x003 - Row Count
Number of rows in the image to be read out (not counting dark rows or border rows that may be read). |
0x0004 | 0x0500(1280) | Sensor Core R4:0—0x004 - Column Count
Number of columns in image to be read out (not counting dark columns or border columns that may be read). |
0x0007 | 0x028D | Exposure control? 0x028E seems to be 33ms, and 0x0000 is about 500ms. XBox Init changes this to 0x28E. |
0x0008 | 0x005F | |
0x000B | 0x0000 | XBox Init changes this to 0x0046 |
0x0021 | 0x8000 | |
0x8105 | 0x0003 | R5:1—0x105 – Aperture Correction
Aperture correction scale factor used for sharpening. Bit 3 Enables automatic sharpness reduction control (see R51:2 0x233). Bits 2:0 Sharpening factor: 000: No sharpening. 001: 25% sharpening. 010: 50% sharpening. 011: 75% sharpening. (Default) 100: 100% sharpening. 101: 125% sharpening. 110: 150% sharpening. 111: 200% sharpening. |
0x0106 | 0x648E | R6:1—0x106 – Operating Mode Control (Read/Write)
XBox Init sets this to 0x7482 This register specifies the operating mode of the IFP. Bit 15 Enables manual white balance. (Default=1) User can set the base matrix and color channel gains. this bit must be asserted and de-asserted with a frame in between to force new color correction settings to take effect. Bit 14 Enables auto exposure. (Default=1) Bit 13 Enables on-the-fly defect correction. (Default=1) Bit 12 Reserved—obsolete. The user should write a “0” to this bit. (Default=0) Bit 11 not used. - Note that this is the description from the mt9v112; the Kinect sets this bit to true in init strings, and it may do something on this variation of the part. Bit 10 Enables lens shading correction. (Default=0) Bits 9:8 Reserved. (Default=0) Bit 7 Enables automatic flicker detection. (Default=0, XBox init sequence sets to 1) Bit 6 Reserved for future expansion. (Default=0) Bit 5 Reserved. (Default=0) Bit 4 Bypasses color correction matrix. (Default=0) 0: Normal color processing. 1: Outputs “raw” color bypassing color correction. Bits 3:2 Auto exposure back light compensation control. 00: Auto exposure sampling window is specified by R38:2 and R39:2 (“large window”). (XBox init) 01: Auto exposure sampling window is specified by R43:2 and R44:2 (“small window”). (Default) 1X: Auto exposure sampling window is specified by the weighted sum of the large window Bit 1 Enables auto white balance. 0: Freezes white balance at current values. (Default) 1: Enables auto white balance. (XBox init) Bit 0 Reserved for future expansion. (Default=1) |
0x0125 | 0x0005 |
R37:1—0x125 – Color Saturation Control (Read/Write) Bit 5:3 Specify overall attenuation of the color saturation. 000: Full color saturation (Default) 001: 75% of full saturation 010: 50% of full saturation 011: 37.5% of full saturation 100: 25% of full saturation 101: 150% of full saturation 110: Black and white
000: No attenuation. 001: Attenuation starts at luminance of 216. 010: Attenuation starts at luminance of 208. 011: Attenuation starts at luminance of 192. 100: Attenuation starts at luminance of 160. 101: Attenuation starts at luminance of 96. (Default) |
0x0147 | 0x2040 | XBox Init sets this to 0x1030 |
0x0153 | 0x0E04 | XBox Init sets this to 0x2010. RGB Gain Ramp. Elements 1,0. Controls dark pixels. |
0x0154 | 0x4C28 | XBox Init sets this to 0x6040. RGB Gain Ramp. Elements 3,2. |
0x0155 | 0x9777 | XBox Init sets this to 0xA080. RGB Gain Ramp. Elements 5,4. Controls medium brightness pixels. |
0x0156 | 0xC7B1 | XBox Init sets this to 0xD0C0. RGB Gain Ramp. Elements 7,6. |
0x0157 | 0xEEDB | XBox Init sets this to 0xF0E0. RGB Gain Ramp. Elements 9,8. |
0x0158 | 0xFF00 | XBox Init sets this to 0xFF00. RGB Gain Ramp. Element 10,11. Controls brights and 11 seems to affect bloom somehow. |
0x019D | 0x3CAE | Seen both 0x3CAE and 0x34AE set for this address |
0x022E | 0x0C44 | XBox Init sets this to 0x1044. |
0x022F | 0x9120 | XBox Init sets this to 0x9100. |
0x0239 | 0x0690 | XBox Init sets this to 0x0516. |
0x023B | 0x03DE | XBox Init sets this to 0x04E6. |
0x0257 | 0x0267 | XBox Init sets this to 0x0264. |
0x0258 | 0x02E1 | XBox Init sets this to 0x02E0. |
0x025C | 0x1610 | XBox Init sets this to 0x1510. |
0x025D | 0x1A14 | XBox Init sets this to 0x1A15. |
NUI Audio
NOTE: audio is still a work-in-progress.
The audio system is a much different beast. It starts by downloading a firmware blob, (optionally) performs cryptographic authentication (so the XBox360 can verify that it is an authentic Kinect), (also optionally) downloads impulse response data (for sound cancellation), and then bidirectionally streams audio.
As discussed on the mailinglist, the firmware package can be downloaded from Microsoft here. Rene Ladan has written a python script that unpacks Xbox360 files, linked here (BSD 2-clause license). The firmware file we seek is audios.bin.
As it turns out, the firmware uploaded by the Xbox360 is completely different from the one uploaded by the Kinect SDK for Windows. The firmware used with the Xbox360 is actually unusable under Windows due to a deficiency in the Windows USB stack - Windows does not support the polling interval that the firmware requests (bInterval = 5, USB high speed, see here for more details).
The firmware used with the Kinect SDK for Windows, on the other hand, presents itself as a somewhat-standard USB Audio Class device and even has its own USB DeviceID (0x02bb). The rest of this documentation will deal with the device using the Xbox360-uploaded firmware, though the section involving the firmware upload applies equally to both.
Firmware image analysis will take place at Audio Firmware.
Firmware upload
We perform a series of USB Bulk transfers to and from the NUI Audio device, endpoint 0x01 for OUT transfers, endpoint 0x81 for IN transfers. They accomplish the following, in order:
- Check firmware version (optional)
- Send a block of the firmware (31 times, or until firmware upload is complete)
- Start execution inside the just-uploaded firmware image in memory
We will find the following structures useful:
typedef struct { uint32_t magic; // 0x06022009 - day after Project Natal's announcement at E3 in MMDDYYYY? uint32_t tag; // reply_code.tag will match this value uint32_t bytes; // uint32_t cmd; // Possibly a command number uint32_t addr; // Argument used for addresses uint32_t unk; // Always zero, as far as we've seen } bootloader_command; typedef struct { uint32_t magic; // 0x0a6fe000 uint32_t tag; // should match the tag in the corresponding bootloader_command uint32_t status; // 0 is success } reply_code;
Recall that USB line protocol is little-endian; thus, so are the contents of these structs.
Firmware version information
We send the following data in a bulk OUT transfer to the NUI device, endpoint 1:
09 20 02 06 01 00 00 00 60 00 00 00 00 00 00 00 15 00 00 00 00 00 00 00
We then ask the Kinect for two replies (bulk IN, endpoint 1). The first reply is 96 bytes long and contains numbers that appear to match 01 01 2025, which is the firmware version described in fwversions.txt found in a Microsoft update package. The
01 00 01 00 00 00 00 00 44 06 00 00 01 00 01 00 00 00 00 00 E9 07 00 00 01 00 01 00 00 00 00 00 E9 07 00 00 FD FF FF FF 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
The second bulk IN transfer takes the form of the reply_code struct. reply_code.status == 0 indicates success. This "Command, Data, Result" transfer paradigm is reused often.
Uploading firmware pages
We upload the firmware to the NUI Audio chip's memory (starting at 0x00080000) in 0x4000-byte blocks in the following fashion:
We start by sending a bootloader_command with cmd=3, addr = the next address (starting at 0x00080000, incrementing by 0x4000 after each block), and bytes = min(firmware bytes left to upload, 0x4000).
Before asking for a reply, we then send bootloader_command.bytes bytes of data in 512-byte bulk OUT transfers with no additional encapsulation. This is 32 bulk OUT transfers for each page other than the last.
Lastly, we request a bulk IN transfer, which should contain a reply_code struct.
Jump to new execution point
Now that we've uploaded the entire firmware image, we can tell the chip to start executing it. For audios.bin, execution should begin at offset +0x30, so we want to tell the device to jump to 0x00080030.
We send a bootloader_cmd with bytes=0, cmd=4, and addr=0x00080030:
09 20 02 06 22 00 00 00 00 00 00 00 04 00 00 00 30 00 08 00 00 00 00 00
Finally, we ask for its reply with a bulk IN transfer, which again should contain a reply_code struct. After we read the reply from this command, the NUI Audio device jumps to the address we specified, drops off the bus, and about a third of a second later, reenumerates.
Verifying Authenticity of the Kinect hardware
Skipping this for now. Microsoft was kind enough to make this optional, so we don't have to make the Kinect prove its authenticity.
The sequence, should we ever wish to implement it, involves control transfers on endpoint 0 and nothing else. It amounts to (approximately) a standard TLS 1.2 handshake over control transfers.
Note: performing authentication may be required to enable the noise-cancelled audio channel. Still working out the details.
Uploading CEMD data
Now we need to tell the Kinect how the speakers are positioned. This involves uploading some more data on how the Kinect should interpret differences between the microphone inputs as well as how each channel of audio output should be cancelled with each microphone input. At best guess, this data is computed by the XBox360 during calibration. Based on the transfer starting with the ASCII text "CEMD", we guess that this is calibration data used for Complex Empirical Mode Decomposition.
This paper and this one discuss Complex Empirical Mode Decomposition and its applications to "Multichannel Information Fusion."
The upload sequence is fairly similar to that of the firmware upload, with a few notable differences. First, the commands themselves are longer with a bunch of zeros at the end:
typedef struct { uint32_t magic;// 0x06022009 again uint32_t tag; // Will be matched in reply uint32_t arg1; // initial command: 0. Firmware blocks: byte count. uint32_t cmd; // Seen with values 0x0133, 0x0134, and 0x0135. uint32_t arg2; // initial command: byte count. Firmware blocks: target address. uint32_t zeros[8]; // Always a bunch of zero bytes. How curious. } cemdloader_command;
We can expect the same reply_code struct as was used for the firmware upload.
Again, we upload the data with three commands:
- Prepare for CEMD upload. For this command, cmd = 0x0133. arg2 = total bytes in CEMD upload (including 20-byte header).
- Upload a block of CEMD data. This uses the same structure and fields as the firmware upload described above, except that cmd = 0x134 and arg2 (addr) starts at 0x0, rather than 0x00080000. Again, note that we send the block of data before requesting a reply_code. Repeat until we have uploaded all of the CEMD data.
- Complete the CEMD upload. cmd = 0x135, arg1 = 0, arg2 = 0x64000 (the number of bytes we uploaded, less the 20-byte header).
Of note: the data appears to be somewhat periodic. After the 20-byte header, there is some structure: each of the 2k blocks (of which there are exactly 200) starts with 0x48 (72) bytes of zeros and ends with 0xf8 (248) bytes of zeros. The data appears to be a long list of little-endian 32-bit IEEE754 floats, with values between -.5 and .5. Interpreted as waveforms, they seem to form a series of impulse response patterns.
Streaming audio
We perform ongoing isochronous transfers on endpoint 2. The general idea is this:
- You send out audio data in 76 byte OUT transfers (4 bytes header, 72 bytes 6-channel 16-bit signed PCM samples) 8 times per msec. This works out to an outgoing data rate of 48KHz.
- A little later, you get back 524 byte IN transfers containing data tagged with the same time window.
You must perform the OUT transfers to receive any IN transfers that aren't 0 bytes. So, if you have no audio to play, send silence in all six audio channels.
Each OUT transfer consists of one audio_out_block as described below. Note that sample_51 matches the channel order of standard WAV PCM data:
typedef struct { int16_t left; int16_t right; int16_t center; int16_t lfe; int16_t surround_left; int16_t surround_right; } sample_51; typedef struct { uint16_t window; // Appears to be a timestamp. Increments when seq hits 0x2b, 0x53, and when seq overflows uint8_t seq; // Starts at 0 when window is 0 mod 3; overflows at 0x80 uint8_t weird; // Complex behavior, see below sample_51 audio[6]; // Six six-channel PCM samples per transfer } audio_out_block;
seq increments by 1 for each transfer. window increments as described above. weird follows its own rules (TODO) but is often 0x01.
IN transfers are a bit different. They may be 524 bytes or 60 bytes.
typedef struct { uint32_t magic; // 0x80000080 uint16_t channel; // Values between 0x01 and 0x0a. Indicates microphone channel (2-9), noise-cancelled audio (1), or other (0xa) uint16_t len; // Length of transfer, in bytes. So far, we've only seen values 524 (0x020c) and 60 (0x003c) uint16_t window; // Timestamp matching that of the OUT transfer that was received at the time this sample was taken uint16_t unknown; // No idea what these bytes represent. uint8_t data[]; // buffer holding (len - 12) bytes. } audio_in_block;
The value of the channel field tells us what stream this data is part of:
- channel = 1: 16-bit signed little-endian PCM samples, unified noise-cancelled channel, 16kHz sample rate.
- channel = 2: 32-bit signed little-endian PCM samples, mic 1, 16kHz sample rate, segment 0
- channel = 3: 32-bit signed little-endian PCM samples, mic 1, 16kHz sample rate, segment 1
- channel = 4: 32-bit signed little-endian PCM samples, mic 2, 16kHz sample rate, segment 0
- channel = 5: 32-bit signed little-endian PCM samples, mic 2, 16kHz sample rate, segment 1
- channel = 6: 32-bit signed little-endian PCM samples, mic 3, 16kHz sample rate, segment 0
- channel = 7: 32-bit signed little-endian PCM samples, mic 3, 16kHz sample rate, segment 1
- channel = 8: 32-bit signed little-endian PCM samples, mic 4, 16kHz sample rate, segment 0
- channel = 9: 32-bit signed little-endian PCM samples, mic 4, 16kHz sample rate, segment 1
- channel = a: unknown data
So for example, to reconstruct microphone 1's signal, we can filter audio_in_block for those with matching channel 2 or 3, sort them by window, then channel, and the data we have is a 16kHz stream of signed 32-bit PCM samples.