vstakhov/mime_parser.md

Created December 15, 2016 14:44

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/vstakhov/937f253d5935ee4158688932589b1dcc.js"></script>
Save vstakhov/937f253d5935ee4158688932589b1dcc to your computer and use it in GitHub Desktop.

Download ZIP

Raw

mime_parser.md

New mime parser algorithm for Rspamd

If we have any complicated (multipart or message) part in the top, then apply the following algorithm. Otherwise goto step 10
Use multipattern scan (hyperscan/aho-corasic) to find all occurences of patterns \r-- and \n--
For each occurence we check sanity:

We find the length of substring where there are no \r and \n
We check the ending of this substring
If this substring is empty, we skip it
If this substring ends with -- we assume it closing

For normal boundaries calculate H(boundary), where H is a hash function described below
For closing boundaries calculate H(boundary) AND H(closed_boundary)
When all boundaries are found, perform parsing of message using array of the following structures:

Boundary offset
Bondary hash (+ closed hash)
End of boundary (potential part start)
Flags (closing boundary)

For each multipart we apply filter function which selects the boundaries that are related to this multipart
filter function uses hashes to select the required boundaries and ensures that all offsets are enclosed within multipart range
Apply this algorithm for all multiparts (recursing to the nested subparts if needed)
For other parts just parse them normally (using CTE detection + heuristic)

Algorithm advantages

Scans text only once when searching for boundaries
We do not compare boundaries which might be long strings but use hashes
It is easy to check overflows and corner cases
It is easy to work with broken mime structures (as no bactracking is needed)
Algorithm is very simple (less than 1k LoC)

Algorithm disadvantages

Cannot work in streaming mode (BAD)
Hash collisions (but see below)

Hash function concerns

We apply siphash with randomly generated key. This key is regenerated after 10k messages are parsed. Hence, the collision probability (both malicious and non-malicious) is negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment