- If we have any complicated (multipart or message) part in the top, then apply the following algorithm. Otherwise goto step 10
- Use multipattern scan (hyperscan/aho-corasic) to find all occurences of patterns
\r--
and\n--
- For each occurence we check sanity:
- We find the length of substring where there are no
\r
and\n
- We check the ending of this substring
- If this substring is empty, we skip it
- If this substring ends with
--
we assume it closing
- For normal boundaries calculate
H(boundary)
, whereH
is a hash function described below - For closing boundaries calculate
H(boundary)
ANDH(closed_boundary)
- When all boundaries are found, perform parsing of message using array of the following structures:
- Boundary offset
- Bondary hash (+ closed hash)
- End of boundary (potential part start)
- Flags (closing boundary)
- For each multipart we apply
filter
function which selects the boundaries that are related to this multipart filter
function uses hashes to select the required boundaries and ensures that all offsets are enclosed within multipart range- Apply this algorithm for all multiparts (recursing to the nested subparts if needed)
- For other parts just parse them normally (using CTE detection + heuristic)
- Scans text only once when searching for boundaries
- We do not compare boundaries which might be long strings but use hashes
- It is easy to check overflows and corner cases
- It is easy to work with broken mime structures (as no bactracking is needed)
- Algorithm is very simple (less than 1k LoC)
- Cannot work in streaming mode (BAD)
- Hash collisions (but see below)
We apply siphash with randomly generated key. This key is regenerated after 10k messages are parsed. Hence, the collision probability (both malicious and non-malicious) is negligible.