Opera Mini OBML file format

This documentation is released under CC BY 4.0.

OBML (Opera Binary Markup Language) files are self-contained, rendered versions of HTML documents generated by the Presto v2 engine. They are static, containing pixel-positioned regions adapted for a specific device's screen size & font metrics. (Thus OBML documents generated for one device tend to look slightly 'off' everywhere else, and a perfect rendering is impossible without knowing the original device.)

Over time, various OBML versions were used, and each Opera Mini version is only compatible with one OBML format, thus an upgrade might leave old saved pages unreadable. (The OBML version used can be seen by visiting debug:.)

Most saved pages use OBML v12, v13, v15, or v16; I haven't investigated the format used by earlier "modded" Opera Mini versions which had this feature added unofficially.

Basic data types

OBML uses these primitive types:

byte – unsigned integer (1 byte)
short – unsigned integer (2 bytes, big-endian)
medium – unsigned integer (3 bytes, big-endian)
blob – consists of a short indicating the length, followed by that many bytes of data
char – a byte containing an ASCII character
string – a blob containing UTF-8 encoded text

Other data types

url ← string

Each OBML file has a "base URL", essentially a reusable prefix. When other URLs start with a null byte, it is to be replaced with the global prefix. For example, if the base is http://example.com/dir and you have an URL value \x00/index.html, it expands to http://example.com/dir/index.html.

color ← (a: byte, r: byte, g: byte, b: byte)

Colors are stored as ARGB tuples, with one byte (0–255) per component.

coords ← (x: short, y: medium)

Coordinates are stored as a short for the X position (0–65535) followed by a medium for the Y position (0–16777215). The origin (0, 0) is in the top-left corner.

In format versions ≤ 13, "position" coordinates are absolute and stored as-is.

In format version ≥ 15, "position" coordinates are stored as relative to the last position coordinate. Relative coordinates wrap around, thus negative offsets are stored as very large positive coordinates. For example, (-4, -2) is stored as (0xFFFC, 0xFFFFFE).

(Note that positions are relative only to the last position – absolute coordinates like sizes do not affect the relative offset in any way.)

Header

The file starts with a file_size: medium followed by version: byte.

In v≥15, file_size is always 0x02d355 and version is always 16; they're followed by a second identical header containing the real values. The reason for that is unknown.

Note that file_size only includes the bytes following it. It doesn't include the field's own size, nor the preceding fields.

Following is page_size: coords.

In v16, following is unknown: bytes[3] (always S\x00\x00).

Following is unknown: short (always 0xFFFF).

Following are page_title: string, unknown: blob, page_url_base: string, and page_url: url. The unknown blob seems to always start with C\x10\x10... on v15, empty otherwise.

Following is an unknown header (6 bytes for v≥15, 5 bytes for v≤13).

In v13, the format appears to be counter: short, unknown: medium.

Following is the "metadata" section and the "content" section, both composed of tagged chunks.

Metadata section

This section consists of several chunks. Each chunk starts with type: char (an ASCII letter), followed by variable amount of fields.

Metadata: 'C' chunks

In v≥15, always contain byte[23] of unknown data.

Metadata: 'M' chunks

Always contain subtype: char, unknown: byte (always 0x00), data: blob.

'S' sub-type

Secure connection (TLS) information. Contains byte[6], cert_expiry: string, secure_status: string, tls_details: string, cert_common_name: string.

Metadata: 'S' chunks

Appear to contain links_size: medium followed by a "links" sub-section of that size.

Links sub-section

This section consists of chunks and appears to be a sub-section of the preceding 'S' chunk.

Links: '\x00' chunks

Always contain unknown: byte and count: byte, followed by that many (id: string, label: string) pairs. Each of these chunks seems to store the <option> choices for a HTML <select> widget.

Links: all region chunks

Most other chunks in this sub-section define 'regions' and share the same data format.

In v≥15 the format is box_count: byte, followed by that many (pos: coords[rel], size: coords) coordinate pairs, followed by link_target: blob, unknown: byte[2] (always \x01\x74), link_type: blob.

In v13 the format is box_count: byte, followed by that many (pos: coords, size: coords) coordinate pairs, followed by link_target: blob, unknown: byte[2], link_type: blob.

In v12 the format is box_count: byte, followed by that many (pos: coords, size: coords) pairs, followed by link_type: blob, link_target: blob.

link_type, if non-empty, seems to be a string with the MIME type.

link_target can be an url, a string, or an unknown blob.

Links: 'I', 'N', 'S', chunks

Unknown region types. These follow the "region chunk" format.

Links: 'C' chunks

In v≥15, an unknown region type. In v12, unknown: byte[24]; likely to also be a region type but I haven't checked yet.

Links: 'i' chunks

Image region (link_target: url links to the original image). Note that this doesn't actually render an image, only define a link region for the original URL. The image itself is drawn by the content section.

Links: 'L' chunks

Link region (link_target: url is the link target). Note that this isn't directly associated with link text in any way; it merely defines the 'active' rectangle overlayed on top of the text.

URLs starting with b: seem to be JavaScript links.

Links: 'P' chunks

Link region similar to 'L' but containing a "platform" link (usually mailto:).

Links: 'w' chunks

Link region similar to 'L' but meant to trigger a file download dialog (for image "Save" buttons). The target URL is hosted by the Opera Mini proxy, and expires after some time.

Links: 'W' chunks

Link region similar to 'w' but meant to open the target in platform's native web browser (for image "Open" buttons).

Content section

Content: 'B' chunks

Define a filled rectangle (a "box"); used to draw background colors, borders, other lines (including even link underlines).

In v≥15, contain pos: coords[rel], size: coords, fill: color.

In v≤13, contain pos: coords, size: coords, fill: color.

Content: 'F' chunks

Form fields.

In v≥15, contain pos: coords[rel], size: coords, foreground: color, type: byte[2], field_id: string, value: string, byte[5].

In v≤13, contain pos: coords[rel], size: coords, foreground: color, type: byte[2], field_id: string, value: string, byte[3].

Types:

a is a multi-line input box (textarea)
c is a checkbox
r is a radio button
x is a single-line input box
s is a select drop-down

Content: 'I' chunks

Image.

In v16, contain pos: coords[rel], size: coords, fill: color, file_addr: medium, unknown: byte[11].

In v15, contain pos: coords[rel], size: coords, fill: color, unknown: byte[14].

In v≤13, contain pos: coords[rel], size: coords, fill: color, unknown: byte[3], file_addr: medium.

fill is the image's average color, for use as placeholder when images are disabled/loading.

file_addr is the byte offset within the 'S'-chunk, relative to the end of data_size.

Content: 'L' chunks

Unknown. Contain byte[9] with unknown data.

Content: 'M' chunks

Unknown. Contain byte[2], blob with unknown data.

Content: 'o' chunks

Not sure if an actual chunk, or just part of the preceding 'I'-chunk.

Contain blob with unknown data.

Content: 'S' chunks

Embedded images.

Contain data_size: medium, followed by some number of file_data: blob. (The blob count isn't given, so keep reading blobs until you've consumed at least data_size bytes.)

Each blob contains an image (PNG or JPEG) to be drawn in all 'I'-chunks whose file_addr matches the blob's offset relative to the end of data_size.

Content: 'T' chunks

Text.

In v16, contain pos: coords[rel], size: coords, foreground: color, unknown: byte, font: byte, unknown_count: byte, unknown_count × (byte, blob) pairs, text: string. (It seems that the unknown pairs define some sort of links.)

In v15, contain pos: coords[rel], size: coords, foreground: color, font: byte, text: string.

In v≤13, contain pos: coords, size: coords, foreground: color, font: byte, text: string.

In font, the least-significant bit indicates bold text. With the 'bold' bit masked out, the remaining value indicates the font size:

0 – medium (approx. 11px)
2 – large (approx. 12px)
4 – extra large (approx. 13px)
6 – small (approx. 10px)

The following CSS results in an acceptable rendering:

font-family: sans-serif;
line-height: 1.1;
white-space: pre;

Content: 'z' chunks

Unknown. Rare. The only occurence seen contains byte[6].

Miscellaneous notes

Forms

Form buttons do not have special representation, they just consist of an image + text + link region, using special b:… URLs.

Input fields and select dropdowns haven't been fully researched yet.

krishna2nd/obml.md