Criware's USM format Part 1

If you played a lot of recent Japanese games, you might have heard of Criware or its full name CRI Middleware Co.. It may be a Japanese gambling game disguised as a cutesy anime mobile game or a AAA action game where you fight God. Chances are, you might have seen their name or logo in the start-up or credits of a Japanese game. As the name CRI Middleware Co. implies, Criware provides middleware for use in video game development. Criware has multiple middlewares for a game developer's audio and video playback needs, from delivering autogenerated lip-sync data for voice files to playback of video and audio. One of these middleware is Sofdec 2, whose container you might recognise by its extension .usm. USM is a container for video and audio to be played by Criware's player, internally named Mana. You may even have encountered this file format for video games produced outside of Japan, explaining why later.

Why would a game developer bother with proprietary middleware for video playback? Doesn't Unity and Unreal Engine provide frameworks for media playback? You might ask. It is true that Sofdec 2 taps into the engine's native media framework. But Sofdec 2, together with ADX 2, Criware's audio middleware, offers the ability to incorporate audio and subtitles into video playback with ease. Sofdec 2 also allows videos to be textures on solids with full transparency support. There are more reasons why game developers use Criware's SDKs, but this isn't an ad for Criware and I'm not a game developer. So let's get on with dissecting this proprietary format.

This will be the first part of a series of posts about Criware's USM format. By the end of this series (or sooner), I'll have a surprise for everyone looking to extract and make their own custom USM files. I promise that I'll post more regularly compared to last year so bear with me. Lastly, I still don't know much about USM, so corrections and additional info are always welcome.

By the end of the first part, we should have learned the following:

  • The basic building block of USM, chunks. First, the general format.
  • Then, the different types of payloads a chunk can have.
  • Finally, how dictionary payloads are encoded.

History

Before we talk about the technical details regarding CRI Movie 2, we'll first briefly discuss its history and how it gained some adoption outside of Japan.

Criware released Sofdec 2 to supersede Sofdec 1, Criware's earlier video playback library. While Sofdec 1 used the SFD container, Sofdec 2 uses a new container format called USM. USM's advantages over SFD are video seek playback support, queue point support, Unicode support, and more video and audio codec support.1 Metadata about the video and audio data present in the USM file is included in the initial header chunks, unlike SFD.

In 2009, Scaleform partnered with Criware to use CRI Movie 2 for Scaleform GFx 3's Scaleform Video. 2 Scaleform, just like Criware, develops middleware for use in video games.

In 2011, Autodesk acquired Scaleform for $36 Million. 3 And in 2012, Autodesk announced their middleware suite for game developers called Autodesk Gameware. 4 Numerous game companies have used their middleware, such as CD Projekt Red and Valve, which may explain how USM has received adoption in video games produced outside of Japan.

In 2017, Autodesk removed Autodesk Gameware from their offerings and announced end of support. 5

In 2018, Criware announced a plugin for Sofdec 2 that allows support for VP9. 67 Increasing the amount of supported video codecs to three.

In 2019, Criware established a subsidiary in China for the booming video game industry there. 8 Since then Sofdec 2 and ADX 2 is used in numerous Chinese games like Azur Lane and Genshin Impact.

The discontinuation of Autodesk's middleware suite means the end of the adoption of CRI Movie 2 for some of the few western game developer that has embraced it. But the adoption of Criware middleware in Chinese video games makes up for it. The success of Azur Lane, Girls Frontline, Genshin Impact, and the loads of Japanese games that use their middleware proves that Criware is far from irrelevant.

Note: Sofdec 2 is the actual name of Criware's video middleware library. It is unclear to me whether CRI Movie 2 refers to the USM container format or is just Criware USA's marketing department rebranding Sofdec 2 to CRI Movie 2. However, I have to agree that Sofdec isn't a nice name, nor is it as catchy as CRI Movie. For this blog post, I'll make USM synonymous with CRI Movie 2. And Sofdec 2 as the entire video playback suite that includes the SDK and the USM format.


Format

Let's start the technical discussion with an overview of the building blocks of USM. A USM file is made up of chunks in a serial manner. Meaning where one chunk ends, the next one begins. It is also important to note that all data stored in a USM is in big-endian—the most significant byte is stored first. The format of a USM chunk is:


Chunk header

Chunk identifier

The first four bytes of a chunk header is its identifier. A chunk identifier indicates the chunk type, whether audio or video related or something else entirely. The identifier is a four-letter ASCII text, and in total, there are three chunk identifiers for USM:

  • CRID
  • @SFV
  • @SFA

A chunk with an identifier of CRID contains information on all the video and audio streams available in the USM file. It also includes information on the format version of the USM file itself. A CRID chunk only appears at the beginning of the USM file and will only have a header payload type.

A chunk with the identifier of @SFV and @SFA contains information on a video or audio stream, respectively. It could be a header, metadata, or the actual frame packet, depending on the chunk's payload type.

Chunk size

Chunk size is the size of the chunk data (payload header, payload, and padding) and does not include the 8-byte chunk header.


Chunk data

Payload offset

As the name suggests, this is the offset from the start of the chunk data to the payload. From existing USM files, this is always 0x18.

Padding size

Padding size is the number of padding bytes appended at the end of the payload.

Channel number

The channel number of a chunk is a 1-byte integer that begins at 0. Typical use is for USMs with one video track and multiple audio tracks for localisation.

Channel numbers are not exclusive, and a video and audio track can have the same channel number. However, audio tracks will never share the same channel number. A CRID chunk will always have a 0 channel number.

Payload type

Payload type is an enum type packed into one byte and can be represented as:

enum payload_type {
    stream = 0,
    header = 1,
    section_end = 2,
    seek = 3,
}

A stream payload is binary data from a video or audio stream. And a header payload contains media metadata about a video or audio track. While a seek payload type includes data about the seek positions of a video track. A section_end payload will state the end of a stream, header, or seek chunk or series of chunks.

Frame time

A frame time is a 4-byte integer used to synchronise audio and video frame's chunks. These are only used for stream chunks and are 0 for everything else.

Frame rate

A chunk's frame rate is a 4-byte integer, and its values differ from the chunk type and a video track's actual framerate. For chunk types that are not a stream, the value is always 30, and for audio stream chunks, the value is always 2997. For video stream chunks, the value is 100 times the video track's frame rate.

From actual USM files, table 1 contains a list of typical frame rates and their corresponding stream chunk's frame rate:

Video frame rate Stream chunk frame rate
24 2400
29.97 2997
30 3000
60 6000

Table 1: Common video frame rates and their corresponding stream chunk framerates.

Payload

Payload is the chunk's actual data; it could be metadata or packets of data from a video or audio track. For stream payloads, the payload is just the bytes from a frame of a video track or a packet of an audio track.

For other payload types, which we'll refer to as dictionary payloads, the payload is just information presented in an array of dictionaries with key strings. An example of this for video seek information is:

video_seek = [
    {
        "ofs_byte": 5696,
        "ofs_frmid": 0,
        "num_skip": 0,
        ...
    },
    {
        "ofs_byte": 569632,
        "ofs_frmid": 60,
        "num_skip": 0,
        ...
    },
    {
        "ofs_byte": 3864416,
        "ofs_frmid": 120,
        "num_skip": 0,
        ...
    },
    ...
]

I excluded some information for brevity, but the vital thing to note is that every dictionary in the array has the same keys. All values of the same key have the same type. The types for a dictionary value as C types are:

  • Char (1 byte)
  • Unsigned char (1 byte)
  • Short (2 bytes)
  • Unsigned short (2 bytes)
  • Integer (4 bytes)
  • Unsigned integer (4 bytes)
  • Long long (8 bytes)
  • Unsigned long long (8 bytes)
  • Float (4 bytes)
  • String (variable and null-terminated)
  • Byte array (variable)

Padding

Padding is just null bytes (0x00) whose size is declared in the payload header's payload size. It is important to note that the USM format is designed for CDs, therefore, padding is essential to align some parts of it to sector boundaries for more efficient reads.


Payload encoding

Let's elaborate more on how dictionary payloads encode their data. To reiterate what I've written in the previous section with additional information:

  • Dictionary payload contains an array of dictionaries.
  • Each dictionary payload has a name.
  • All dictionaries in the array have the same set of keys.
  • All dictionaries in the array have ASCII string keys.
  • All dictionaries in the array have the same order of keys.
  • All values of the same key have the same type.
  • The same key's value may or may not differ from the others.
  • A value with a string type can be encoded in either Shift-JIS, UTF-8, or UTF-16.

There are four arrays in a dictionary payload. They are:

  • An array for shared data.
  • An array for unique data.
  • An array for C-strings.
  • And an array for byte arrays.

Before I describe what the four arrays contains, I'll give a simple example. In the first part of the example, is a Python code snippet that gives a high-level view of the contents and structure of the payload. Next are the equivalent byte array (presented in hex), if it were converted to an actual payload.

Note: The following example is for demonstration purposes and is not used in any actual USM file.

Sample payload

payload.name = "Example payload"
payload.dicts = [
    {
        "filename": (ValueType.string, "foo.txt"),
        "filesize": (ValueType.int, 12345678),
        "version": (ValueType.char, 1),
        "owner": (ValueType.string, "donmai")
    },
    {
        "filename": (ValueType.string, "bar.txt"),
        "filesize": (ValueType.int, 87654321),
        "version": (ValueType.char, 1),
        "owner": (ValueType.string, "donmai")
    },
]

Shared array

Shared array is 25 bytes.

5A 00 00 00 17 54 00 00 00 20 30 00 00 00 29 01
3A 00 00 00 31 00 00 00 3F

Unique array

Unique array is 16 bytes.

00 00 00 37 00 BC 61 4E 00 00 00 46 05 39 7F B1

C-string array

C-string array is 78 bytes. A \x00 denotes a null-byte.

<NULL>\x00Example payload\x00filename\x00filesize\x00version\x00owner\x00foo.txt\x00donmai\x00bar.txt\x00

Byte array

Byte array is empty.

Explanation

Let's look at this byte per byte, starting with the shared array. The bytes in a shared array are grouped—the number of groups equal to the number keys in a dictionary. Our example's dictionary has four keys, so four groups are in the shared array. The first byte in a group contains two pieces of information: the value type and whether it is unique or recurring. From our example, we know that the first value, "foo.txt", is a string and unique. To pack these two pieces of information together into one byte: First, we convert a value's type to a number using table 2. Next, we convert a value's occurrence to a number using table 3. Finally, we combine these two numbers by adding our value type's number. The value occurrence's number shifted 5 bits to the right. For our first value, we ge 1A for our value type and 2 for our value occurence. We then add them like this: 1A + (2 >> 5) = 5A.

Value type Number Size
Char 0x10 1
Unsigned char 0x11 1
Short 0x12 2
Unsigned short 0x13 2
Integer 0x14 4
Unsigned Integer 0x15 4
Long long 0x16 8
Unsigned long long 0x17 8
Float 0x18 4
Double 0x19 8
String 0x1A Pointer size is 4 bytes
Bytes 0x1B Start and end pointers are 4 bytes

Table 2: Conversion table for a value's type and its corresponding number.

Value occurrence Number
Recurring 1
At least one value is unique 2

Table 3: Conversion table for a value's occurrence and its corresponding number.

After the first byte, the following four bytes is a start offset of the key in the C-string array. 00 00 00 17 is 23 and would point to the start of filename\x00filesize\x00ver.... Since the string is null-byte terminated, our key with the null-byte discarded would be filename which is indeed the key for the first item in the dictionary. Finally, since our value is unique, we would find the value in the unique array. Since our value's type is a string, the pointer size is 4 bytes. Taking the first four bytes of the unique array, we get 00 00 00 37. This pointer is pointing to the start of foo.txt\x00donmai\x00... which means our value is foo.txt. After we got our value, this ends the first group of bytes in the shared array.

Let's move on to the second group, and let's make it brief:

  1. Shared array second group: first byte = 54 = 0x14 + (2 >> 5). Second value is an integer and unique.
  2. Next four bytes in shared array: 00 00 00 20 => filesize\x00. Second key is filesize.
  3. Since value is unique and an integer. Next four bytes in unique array = 00 BC 61 4E. Second value is 12345678.

Third group:

  1. Shared array third group: first byte = 30 = 0x10 + (1 >> 5). Third value is a char and recurring.
  2. Next four bytes in shared array: 00 00 00 29 => version\x00. Third key is version.
  3. Since value is recurring and a char. Next byte in shared array = 01. Third value is 1.

Final group:

  1. Shared array fourth group: first byte = 3A = 0x1A + (1 >> 5). Fourth value is a string and recurring.
  2. Next four bytes in shared array: 00 00 00 31 => owner\x00. Third key is owner.
  3. Since value is recurring and a string. Next four bytes in shared array = 00 00 00 3F => donmai\x00. Fourth value is donmai.

We would have the following dictionary discarding value type and occurrence:

{
    "filename": "foo.txt",
    "filesize": 12345678,
    "version": 1,
    "owner": "donmai",
}

Which is precisely the dictionary we made in the example.

Now that we're done with the first dictionary in the array let's move on to the second and final dictionary. To do that, we need to point back to the start of the shared array and retain our pointer in the unique array. Effectively the shared and unique array is now equivalent to this:

  • Shared array: 5A 00 00 00 17 54 00 00 00 20 30 00 00 00 29 01 3A 00 00 00 31 00 00 00 3F
  • Unique array: 00 00 00 46 05 39 7F B1

The C-string and byte arrays are still the same since the offsets stored in the shared and unique arrays are absolute. Now let's do what we did before, but this time for the second dictionary.

First group:

  1. Shared array first group: first byte = 5A = 0x1A + (2 >> 5). First value is a string and unique.
  2. Next four bytes in shared array: 00 00 00 17=> filename\x00. First key is filename.
  3. Since value is unique and a string. Next four bytes in unique array = 00 00 00 46 => bar.txt\x00. First value is bar.txt.

Second group:

  1. Shared array second group: first byte = 54 = 0x14 + (2 >> 5). Second value is an integer and unique.
  2. Next four bytes in shared array: 00 00 00 20 => filesize\x00. Second key is filesize.
  3. Since value is unique and an integer. Next four bytes in unique array = 05 39 7F B1. Second value is 87654321.

Third group:

  1. Shared array third group: first byte = 30 = 0x10 + (1 >> 5). Third value is a char and recurring.
  2. Next four bytes in shared array: 00 00 00 29 => version\x00. Third key is version.
  3. Since value is recurring and a char. Next byte in shared array = 01. Third value is 1.

Final group:

  1. Shared array fourth group: first byte = 3A = 0x1A + (1 >> 5). Fourth value is a string and recurring.
  2. Next four bytes in shared array: 00 00 00 31 => owner\x00. Third key is owner.
  3. Since value is recurring and a string. Next four bytes in shared array = 00 00 00 3F => donmai\x00. Fourth value is donmai.

From the procedures we've done we derive this dictionary, again, discarding value types and occurrence:

{
    "filename": "bar.txt",
    "filesize": 87654321,
    "version": 1,
    "owner": "donmai",
}

Which is the same as the second dictionary in our example. To summarize the process:

  • Each item in a dictionary is represented as a group in the shared array.
  • The first byte is: T + (O >> 5). Where T is the value's type corresponding number, and O is the value's occurrence number.
  • The next four bytes are the offset of the key in the C-string array.
  • If the value is recurring - meaning all the values for the same key across all dictionaries are the same - the value or pointer/s are stored in the shared array.
  • Else value or pointer/s are stored in the unique array.
  • For strings, the pointer stored in either the shared or unique array. This pointer points to the start of the string in the C-string array.
  • For bytes, the first and second pointer in either the shared or unique array is the start and end pointers, respectively.
  • When moving to the following dictionary in the array, we point back to the start of the shared array and retain where we point to the unique array.

I hope this gave you a clear understanding of how USM encodes an array of dictionaries. Now let's move on to how to put everything together. A dictionary payload is structured as follows:

  • Header
    • Identifier (4 bytes)
    • Payload size (4 bytes)
  • Data
    • Unique array offset (4 bytes)
    • C-string array offset (4 bytes)
    • Byte array offset (4 bytes)
    • Payload name offset (4 bytes)
    • Number of items per dictionary (2 bytes)
    • Unique array size per dictionary (2 bytes)
    • Number of dictionaries (4 bytes)
    • Shared array
    • Unique array
    • C-string array
    • Byte array

The header of the payload is 8 bytes and composed of two data: First, the identifier, a four-letter ASCII string that has a value of @UTF. Second, the payload size, a 4-byte integer that states the size of the actual data and does not include this 8-byte header.

The unique, C-string, byte array offsets are 4-byte integers that point to the start of their respective arrays relative after the 8-byte header. Next, the payload name offset points to the start of the payload name in the C-string array. The number of items per dictionary and the number of dictionaries is self-explanatory. Finally, the unique array size per dictionary is the size in bytes consumed per dictionary.

For our example above the payload is:

  • Header
    • Identifier: 40 55 54 46
    • Payload size: 00 00 00 8F
  • Data
    • Unique array offset: 00 00 00 31
    • C-string array offset: 00 00 00 41
    • Byte array offset: 00 00 00 8F
    • Payload name offset: 00 00 00 07
    • Number of items per dictionary: 00 04
    • Unique array size per dictionary: 00 08
    • Number of dictionaries: 00 00 00 02
    • Shared array: 5A 00 00 00 17 54 00 00 00 20 30 00 00 00 29 01 3A 00 00 00 31 00 00 00 3F
    • Unique array: 00 00 00 37 00 BC 61 4E 00 00 00 46 05 39 7F B1
    • C-string array: 3C 4E 55 4C 4C 3E 00 45 78 61 6D 70 6C 65 20 70 61 79 6C 6F 61 64 00 66 69 6C 65 6E 61 6D 65 00 66 69 6C 65 73 69 7A 65 00 76 65 72 73 69 6F 6E 00 6F 77 6E 65 72 00 66 6F 6F 2E 74 78 74 00 64 6F 6E 6D 61 69 00 62 61 72 2E 74 78 74 00
    • Byte array: NONE

Putting everything in one contiguous block of bytes would produce our actual payload:

40 55 54 46 00 00 00 8F 00 00 00 31 00 00 00 41
00 00 00 8F 00 00 00 07 00 04 00 08 00 00 00 02
5A 00 00 00 17 54 00 00 00 20 30 00 00 00 29 01
3A 00 00 00 31 00 00 00 3F 00 00 00 37 00 BC 61
4E 00 00 00 46 05 39 7F B1 3C 4E 55 4C 4C 3E 00
45 78 61 6D 70 6C 65 20 70 61 79 6C 6F 61 64 00
66 69 6C 65 6E 61 6D 65 00 66 69 6C 65 73 69 7A
65 00 76 65 72 73 69 6F 6E 00 6F 77 6E 65 72 00
66 6F 6F 2E 74 78 74 00 64 6F 6E 6D 61 69 00 62
61 72 2E 74 78 74 00

Conclusion

If you are with me up to this point, we should have a sufficient understanding of the building blocks of a USM file and how they are encoded. In the next part, I'll discuss the different types of chunks used in an actual USM file, their purpose and how they are structured. I'll try my best to post the following parts of this as soon as possible. I was initially going to include everything in one long post but it took too much time, and external factors forced me to post at least a part of this as soon as possible. Thank you for your time, and see you again soon.


Saltpack signed message


You'll only receive email when they publish something new.

More from donmai
All posts