Skip to content

Instantly share code, notes, and snippets.

@flip111
Created March 7, 2019 17:24
Show Gist options
  • Save flip111/775dedf39c46f84325ce58c2c6475ea8 to your computer and use it in GitHub Desktop.
Save flip111/775dedf39c46f84325ce58c2c6475ea8 to your computer and use it in GitHub Desktop.
Longest common substring PHP algorithm benchmark (see comments for results)
<?php
// reset && /usr/bin/time php lcs.php
$utf8test = true; // utf8 or ascii test input
$decimals = 2;
// test iterations
// amount is chosen so that each test takes about 10 minutes on my PC (AMD Ryzen 2400G)
// tune: round((10×60) × (iterations / runtime)) for 10 minutes
if ($utf8test) {
printf("## UTF-8 test input ##\n");
$iterations = 61;
} else {
printf("## ASCII test input ##\n");
$iterations = 814;
}
printf("Iterations: %d\n\n", $iterations);
// https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#PHP
function get_longest_common_subsequence($string_1, $string_2)
{
$string_1_length = strlen($string_1);
$string_2_length = strlen($string_2);
$return = '';
if ($string_1_length === 0 || $string_2_length === 0)
{
// No similarities
return $return;
}
$longest_common_subsequence = array();
// Initialize the CSL array to assume there are no similarities
$longest_common_subsequence = array_fill(0, $string_1_length, array_fill(0, $string_2_length, 0));
$largest_size = 0;
for ($i = 0; $i < $string_1_length; $i++)
{
for ($j = 0; $j < $string_2_length; $j++)
{
// Check every combination of characters
if ($string_1[$i] === $string_2[$j])
{
// These are the same in both strings
if ($i === 0 || $j === 0)
{
// It's the first character, so it's clearly only 1 character long
$longest_common_subsequence[$i][$j] = 1;
}
else
{
// It's one character longer than the string from the previous character
$longest_common_subsequence[$i][$j] = $longest_common_subsequence[$i - 1][$j - 1] + 1;
}
if ($longest_common_subsequence[$i][$j] > $largest_size)
{
// Remember this as the largest
$largest_size = $longest_common_subsequence[$i][$j];
// Wipe any previous results
$return = '';
// And then fall through to remember this new value
}
if ($longest_common_subsequence[$i][$j] === $largest_size)
{
// Remember the largest string(s)
$return = substr($string_1, $i - $largest_size + 1, $largest_size);
}
}
// Else, $CSL should be set to 0, which it was already initialized to
}
}
// Return the list of matches
return $return;
}
// https://stackoverflow.com/a/5109221/1833322
function longest_common_substring($words) {
$words = array_map('strtolower', array_map('trim', $words));
$sort_by_strlen = create_function('$a, $b', 'if (strlen($a) == strlen($b)) { return strcmp($a, $b); } return (strlen($a) < strlen($b)) ? -1 : 1;');
usort($words, $sort_by_strlen);
// We have to assume that each string has something in common with the first
// string (post sort), we just need to figure out what the longest common
// string is. If any string DOES NOT have something in common with the first
// string, return false.
$longest_common_substring = array();
$shortest_string = str_split(array_shift($words));
while (sizeof($shortest_string)) {
array_unshift($longest_common_substring, '');
foreach ($shortest_string as $ci => $char) {
foreach ($words as $wi => $word) {
if (!strstr($word, $longest_common_substring[0] . $char)) {
// No match
break 2;
} // if
} // foreach
// we found the current char in each word, so add it to the first longest_common_substring element,
// then start checking again using the next char as well
$longest_common_substring[0].= $char;
} // foreach
// We've finished looping through the entire shortest_string.
// Remove the first char and start all over. Do this until there are no more
// chars to search on.
array_shift($shortest_string);
}
// If we made it here then we've run through everything
usort($longest_common_substring, $sort_by_strlen);
return array_pop($longest_common_substring);
}
// https://stackoverflow.com/a/36451409/1833322
function getLongestMatchingSubstring($str1, $str2)
{
$len_1 = strlen($str1);
$longest = '';
for($i = 0; $i < $len_1; $i++){
for($j = $len_1 - $i; $j > 0; $j--){
$sub = substr($str1, $i, $j);
if (strpos($str2, $sub) !== false && strlen($sub) > strlen($longest)){
$longest = $sub;
break;
}
}
}
return $longest;
}
// https://gist.github.com/relaxnow/1276502
function getLongestCommonSubstring($first, $second)
{
$longestCommonSubstringIndexInFirst = 0;
$table = array();
$largestFound = 0;
$firstLength = strlen($first);
$secondLength = strlen($second);
for ($i = 0; $i < $firstLength; $i++) {
for ($j = 0; $j < $secondLength; $j++) {
if ($first[$i] === $second[$j]) {
if (!isset($table[$i])) {
$table[$i] = array();
}
if ($i > 0 && $j > 0 && isset($table[$i-1][$j-1])) {
$table[$i][$j] = $table[$i-1][$j-1] + 1;
}
else {
$table[$i][$j] = 1;
}
if ($table[$i][$j] > $largestFound) {
$largestFound = $table[$i][$j];
$longestCommonSubstringIndexInFirst = $i - $largestFound + 1;
}
}
}
}
if ($largestFound === 0) {
return "";
}
else {
return substr($first, $longestCommonSubstringIndexInFirst, $largestFound);
}
}
// https://gist.github.com/relaxnow/1276502#gistcomment-2762941
function lcs2($first, $second)
{
$len1 = strlen($first);
$len2 = strlen($second);
if ($len1 < $len2) {
$shortest = $first;
$longest = $second;
$len_shortest = $len1;
} else {
$shortest = $second;
$longest = $first;
$len_shortest = $len2;
}
//check max len
$pos = strpos($longest, $shortest);
if($pos !== false) return $shortest;
for ($i = 1, $j = $len_shortest - 1; $j > 0; --$j, ++$i) {
for($k = 0; $k <= $i; ++$k){
$substr = substr($shortest, $k, $j);
$pos = strpos($longest, $substr);
if($pos !== false) return $substr;
}
}
return "";
}
// https://gist.github.com/relaxnow/1276502#gistcomment-2779505
function lcs2mb($first, $second)
{
$len1 = strlen($first);
$len2 = strlen($second);
if ($len1 < $len2) {
$shortest = $first;
$longest = $second;
$len_shortest = $len1;
} else {
$shortest = $second;
$longest = $first;
$len_shortest = $len2;
}
//check max len
$pos = strpos($longest, $shortest);
if($pos !== false) return $shortest;
for ($i = 1, $j = $len_shortest - 1; $j > 0; --$j, ++$i) {
for($k = 0; $k <= $i; ++$k){
$substr = substr($shortest, $k, $j);
$pos = strpos($longest, $substr);
//if found then check last char if it is > 128 then trim it
if($pos !== false) return ord($substr[$j-1]) > 128 ? substr($substr, 0, $j - 1) : $substr;
}
}
return "";
}
function strlcs(string $first, string $second) : string {
if ($first === '' || $second === '') {
return '';
}
$len1 = strlen($first);
$len2 = strlen($second);
if ($len1 < $len2) {
$shortest = $first;
$longest = $second;
$len_shortest = $len1;
} else {
$shortest = $second;
$longest = $first;
$len_shortest = $len2;
}
$pos = strpos($longest, $shortest);
if ($pos !== false) {
return $shortest;
}
for ($i = 1, $j = $len_shortest - 1; $j > 0; --$j, ++$i) {
for ($k = 0; $k <= $i; ++$k) {
$substr = substr($shortest, $k, $j);
$pos = strpos($longest, $substr);
if ($pos !== false) {
return $substr;
}
}
}
return '';
}
// https://stackoverflow.com/a/47825459/1833322
/**
* Determine the longest common match within two strings
*
* @param string $str1
* @param string $str2 Two strings in any order.
* @param boolean $case_sensitive Set to true to force
* case sensitivity. Default: false (case insensitive).
* @return string The longest string - first match.
*/
// The following two functions give wrong output when one of the strings is empty.
function get_longest_common_subsequence2( $str1, $str2, $case_sensitive = false ) {
// We'll use '#' as our regex delimiter. Any character can be used as we'll quote the string anyway,
$delimiter = '#';
// We'll find the shortest string and use that to create our regex.
$l1 = strlen( $str1 );
$l2 = strlen( $str2 );
$str = $l1 <= $l2 ? $str1 : $str2;
$l = min( $l1, $l2 );
// Regex for each character will be of the format (?:a(?=b))?
// We also need to capture the last character, but this prevents us from matching strings with a single character. (?:.|c)?
$reg = $delimiter;
for ( $i = 0; $i < $l; $i++ ) {
$a = preg_quote( $str[ $i ], $delimiter );
$b = $i + 1 < $l ? preg_quote( $str[ $i + 1 ], $delimiter ) : false;
$reg .= sprintf( $b !== false ? '(?:%s(?=%s))?' : '(?:.|%s)?', $a, $b );
}
$reg .= $delimiter;
if ( ! $case_sensitive ) {
$reg .= 'i';
}
// Resulting example regex from a string 'abbc':
// '#(?:a(?=b))?(?:b(?=b))?(?:b(?=c))?(?:.|c)?#i';
// Perform our regex on the remaining string
$str = $l1 <= $l2 ? $str2 : $str1;
if ( preg_match_all( $reg, $str, $matches ) ) {
// $matches is an array with a single array with all the matches.
return array_reduce( $matches[0], function( $a, $b ) {
$al = strlen( $a );
$bl = strlen( $b );
// Return the longest string, as long as it's not a single character.
return $al >= $bl || $bl <= 1 ? $a : $b;
}, '' );
}
// No match - Return an empty string.
return '';
}
// http://benwendt.ca/articles/longest-common-substring-in-php/
function longest_common_substring2($string1, $string2) {
$L = array();
$length = 0;
$pos = 0;
$array1 =str_split($string1);
$array2 =str_split($string2);
foreach ($array1 as $i => $c1) {
$L[$i] = array();
foreach ($array2 as $j => $c2) {
$L[$i][$j] = 0;
if ($c1 == $c2) {
if ($i == 0 || $j == 0) {
// initialize that this character position exists.
$L[$i][$j] = 1;
} else {
// increment previous or reset.
if (isset($L[$i-1][$j-1])) {
$L[$i][$j] = $L[$i-1][$j-1] + 1;
} else {
$L[$i][$j] = 0;
}
}
if ($L[$i][$j] > $length) {
$length = $L[$i][$j];
}
if ((isset($L[$i][$j]))&&($L[$i][$j] == $length)) {
$pos = $i;
}
}
}
}
if ($length > 0) {
return substr($string1, $pos - $length + 1, $length);
} else {
return '';
}
}
// https://www.programmingalgorithms.com/algorithm/longest-common-substring?lang=PHP
function LongestCommonSubstring($str1, $str2, &$subStr)
{
$subStr = "";
if ($str1 == "" || $str2 == "")
return 0;
$num = array();
$maxlen = 0;
$lastSubsBegin = 0;
$subStrBuilder = "";
$str1Len = strlen($str1);
$str2Len = strlen($str2);
for ($i = 0; $i < $str1Len; $i++)
{
for ($j = 0; $j < $str2Len; $j++)
{
if ($str1[$i] != $str2[$j])
{
$num[$i][$j] = 0;
}
else
{
if (($i == 0) || ($j == 0))
$num[$i][$j] = 1;
else
$num[$i][$j] = 1 + $num[$i - 1][$j - 1];
if ($num[$i][$j] > $maxlen)
{
$maxlen = $num[$i][$j];
$thisSubsBegin = $i - $num[$i][$j] + 1;
if ($lastSubsBegin == $thisSubsBegin)
{
$subStrBuilder .= $str1[$i];
}
else
{
$lastSubsBegin = $thisSubsBegin;
$subStrBuilder = "";
$subStrBuilder .= substr($str1, $lastSubsBegin, ($i + 1) - $lastSubsBegin);
}
}
}
}
}
$subStr = $subStrBuilder;
return $maxlen;
}
// 4 byte noise
$smallNoise = 'AV6S';
// 128 byte noise
$bigNoise = <<<'EOD'
trmzoLWSukSoYmfAhO0cTlzcC61F2Y1OMDYpmKucLkzSEohePgublsqnZhGpkIZbuq8P37kJ3QZAcTITNbS2SK17blivlfgsKSRagmWuFtQEvTnt8Rev5Xps7XxvxLrg
EOD;
$lorem = <<<'EOD'
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce id volutpat ligula. Suspendisse molestie pellentesque purus, eu dictum magna imperdiet ut. Morbi sit amet orci massa. Proin sed convallis massa. Integer sodales nisl vestibulum, tempus metus ut, semper lacus. Quisque viverra luctus urna in venenatis. Pellentesque volutpat lacus sit amet ex malesuada volutpat. Pellentesque elementum at tellus eu molestie. Praesent vel aliquet dui.
Nullam semper magna arcu, at tincidunt nisi malesuada vel. Curabitur tincidunt felis quis ligula aliquam, nec lacinia ligula ornare. Curabitur tincidunt lorem vitae eros consequat faucibus. Donec massa enim, efficitur non lacinia ut, viverra eu mauris. Vestibulum sollicitudin massa est, sit amet scelerisque arcu dapibus quis. Pellentesque dapibus in ante eget euismod. Aliquam faucibus arcu nec ante hendrerit sagittis. Praesent aliquet, leo sed pretium tempus, mauris ante malesuada risus, non fringilla dui eros a velit. Aliquam malesuada nisl sed est porta, vitae lobortis dolor vehicula. Vestibulum id consectetur mi. Proin at mi ac dolor gravida viverra. In quis dui sollicitudin augue pharetra iaculis. In hac habitasse platea dictumst. Etiam eu faucibus est, in auctor elit.
Phasellus tincidunt quam velit, nec lacinia orci mattis sed. Mauris ut orci sit amet neque suscipit volutpat in id neque. Quisque sollicitudin, mauris a vehicula efficitur, ipsum enim auctor urna, id tincidunt tortor sapien eget dolor. Ut elementum dui sed porta tristique. Proin gravida ornare mi non luctus. Vestibulum faucibus lectus at nulla varius, vitae congue nunc pretium. Vivamus tortor lectus, auctor at augue molestie, porta scelerisque ligula.
EOD;
$loremStart = 'Lorem ipsum';
$loremMiddle = 'Pellentesque dapibus';
$loremEnd = 'scelerisque ligula';
//http://kermitproject.org/utf8.html
$iCanEatGlass = <<<'EOD'
काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥
kācaṃ śaknomyattum; nopahinasti mām.
ὕαλον ϕαγεῖν δύναμαι· τοῦτο οὔ με βλάπτει.
Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα.
Μπορῶ νὰ φάω σπασμένα γυαλιὰ χωρὶς νὰ πάθω τίποτα.
Vitrum edere possum; mihi non nocet.
Je puis mangier del voirre. Ne me nuit.
Je peux manger du verre, ça ne me fait pas mal.
Pòdi manjar de veire, me nafrariá pas.
J'peux manger d'la vitre, ça m'fa pas mal.
Dji pou magnî do vêre, çoula m' freut nén må.
Ch'peux mingi du verre, cha m'foé mie n'ma.
Mwen kap manje vè, li pa blese'm.
Kristala jan dezaket, ez dit minik ematen.
Puc menjar vidre, que no em fa mal.
Puedo comer vidrio, no me hace daño.
Puedo minchar beire, no me'n fa mal .
Eu podo xantar cristais e non cortarme.
Posso comer vidro, não me faz mal.
Posso comer vidro, não me machuca.
M' podê cumê vidru, ca ta maguâ-m'.
Ami por kome glas anto e no ta hasimi daño.
Posso mangiare il vetro e non mi fa male.
Sôn bôn de magnà el véder, el me fa minga mal.
Me posso magna' er vetro, e nun me fa male.
M' pozz magna' o'vetr, e nun m' fa mal.
Mi posso magnare el vetro, no'l me fa mae.
Pòsso mangiâ o veddro e o no me fà mâ.
Puotsu mangiari u vitru, nun mi fa mali.
Jau sai mangiar vaider, senza che quai fa donn a mai.
Pot să mănânc sticlă și ea nu mă rănește.
Mi povas manĝi vitron, ĝi ne damaĝas min.
Mý a yl dybry gwéder hag éf ny wra ow ankenya.
Dw i'n gallu bwyta gwydr, 'dyw e ddim yn gwneud dolur i mi.
Foddym gee glonney agh cha jean eh gortaghey mee.
᚛᚛ᚉᚑᚅᚔᚉᚉᚔᚋ ᚔᚈᚔ ᚍᚂᚐᚅᚑ ᚅᚔᚋᚌᚓᚅᚐ᚜
Con·iccim ithi nglano. Ním·géna.
Is féidir liom gloinne a ithe. Ní dhéanann sí dochar ar bith dom.
Ithim-sa gloine agus ní miste damh é.
S urrainn dhomh gloinne ithe; cha ghoirtich i mi.
ᛁᚳ᛫ᛗᚨᚷ᛫ᚷᛚᚨᛋ᛫ᛖᚩᛏᚪᚾ᛫ᚩᚾᛞ᛫ᚻᛁᛏ᛫ᚾᛖ᛫ᚻᛖᚪᚱᛗᛁᚪᚧ᛫ᛗᛖ᛬
Ic mæg glæs eotan ond hit ne hearmiað me.
Ich canne glas eten and hit hirtiþ me nouȝt.
I can eat glass and it doesn't hurt me.
[aɪ kæn iːt glɑːs ænd ɪt dɐz nɒt hɜːt miː]
⠊⠀⠉⠁⠝⠀⠑⠁⠞⠀⠛⠇⠁⠎⠎⠀⠁⠝⠙⠀⠊⠞⠀⠙⠕⠑⠎⠝⠞⠀⠓⠥⠗⠞⠀⠍⠑
Mi kian niam glas han i neba hot mi.
Ah can eat gless, it disnae hurt us.
𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸.
ᛖᚴ ᚷᛖᛏ ᛖᛏᛁ ᚧ ᚷᛚᛖᚱ ᛘᚾ ᚦᛖᛋᛋ ᚨᚧ ᚡᛖ ᚱᚧᚨ ᛋᚨᚱ
Ek get etið gler án þess að verða sár.
Eg kan eta glas utan å skada meg.
Jeg kan spise glass uten å skade meg.
Eg kann eta glas, skaðaleysur.
Ég get etið gler án þess að meiða mig.
Jag kan äta glas utan att skada mig.
Jeg kan spise glas, det gør ikke ondt på mig.
Æ ka æe glass uhen at det go mæ naue.
Ik kin glês ite, it docht me net sear.
Ik kan glas eten, het doet mij geen kwaad.
Iech ken glaas èèse, mer 't deet miech jing pieng.
Ek kan glas eet, maar dit doen my nie skade nie.
Ech kan Glas iessen, daat deet mir nët wei.
Ich kann Glas essen, ohne mir zu schaden.
Ich kann Glas verkasematuckeln, ohne dattet mich wat jucken tut.
Isch kann Jlaas kimmeln, uuhne datt mich datt weh dääd.
Ich koann Gloos assn und doas dudd merr ni wii.
Iech konn glaasch voschbachteln ohne dass es mir ebbs daun doun dud.
'sch kann Glos essn, ohne dass'sch mer wehtue.
Isch konn Glass fresse ohne dasses mer ebbes ausmache dud.
I kå Glas frässa, ond des macht mr nix!
I ka glas eassa, ohne dass mar weh tuat.
I koh Glos esa, und es duard ma ned wei.
I kaun Gloos essen, es tuat ma ned weh.
Ich chan Glaas ässe, das schadt mir nöd.
Ech cha Glâs ässe, das schadt mer ned.
Meg tudom enni az üveget, nem lesz tőle bajom.
Voin syödä lasia, se ei vahingoita minua.
Sáhtán borrat lása, dat ii leat bávččas.
Мон ярсан суликадо, ды зыян эйстэнзэ а ули.
Mie voin syvvä lasie ta minla ei ole kipie.
Minä voin syvvä st'oklua dai minule ei ole kibie.
Ma võin klaasi süüa, see ei tee mulle midagi.
Es varu ēst stiklu, tas man nekaitē.
Aš galiu valgyti stiklą ir jis manęs nežeidžia
Mohu jíst sklo, neublíží mi.
Môžem jesť sklo. Nezraní ma.
Mogę jeść szkło i mi nie szkodzi.
Lahko jem steklo, ne da bi mi škodovalo.
Ja mogu jesti staklo, i to mi ne šteti.
Ја могу јести стакло, и то ми не штети.
Можам да јадам стакло, а не ме штета.
Я могу есть стекло, оно мне не вредит.
Я магу есці шкло, яно мне не шкодзіць.
Ja mahu jeści škło, jano mne ne škodzić.
Я можу їсти скло, і воно мені не зашкодить.
Мога да ям стъкло, то не ми вреди.
მინას ვჭამ და არა მტკივა.
Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։
Unë mund të ha qelq dhe nuk më gjen gjë.
Cam yiyebilirim, bana zararı dokunmaz.
جام ييه بلورم بڭا ضررى طوقونمز
Алам да бар, пыяла, әмма бу ранит мине.
Men shisha yeyishim mumkin, ammo u menga zarar keltirmaydi.
Мен шиша ейишим мумкин, аммо у менга зарар келтирмайди.
আমি কাঁচ খেতে পারি, তাতে আমার কোনো ক্ষতি হয় না।
मी काच खाऊ शकतो, मला ते दुखत नाही.
ನನಗೆ ಹಾನಿ ಆಗದೆ, ನಾನು ಗಜನ್ನು ತಿನಬಹುದು
मैं काँच खा सकता हूँ और मुझे उससे कोई चोट नहीं पहुंचती.
എനിക്ക് ഗ്ലാസ് തിന്നാം. അതെന്നെ വേദനിപ്പിക്കില്ല.
நான் கண்ணாடி சாப்பிடுவேன், அதனால் எனக்கு ஒரு கேடும் வராது.
నేను గాజు తినగలను మరియు అలా చేసినా నాకు ఏమి ఇబ్బంది లేదు
මට වීදුරු කෑමට හැකියි. එයින් මට කිසි හානියක් සිදු නොවේ.
میں کانچ کھا سکتا ہوں اور مجھے تکلیف نہیں ہوتی ۔
زه شيشه خوړلې شم، هغه ما نه خوږوي
.من می توانم بدونِ احساس درد شيشه بخورم
أنا قادر على أكل الزجاج و هذا لا يؤلمني.
Nista' niekol il-ħġieġ u ma jagħmilli xejn.
אני יכול לאכול זכוכית וזה לא מזיק לי.
איך קען עסן גלאָז און עס טוט מיר נישט װײ.
Metumi awe tumpan, ɜnyɜ me hwee.
Inā iya taunar gilāshi kuma in gamā lāfiyā.
إِنا إِىَ تَونَر غِلَاشِ كُمَ إِن غَمَا لَافِىَا
Mo lè je̩ dígí, kò ní pa mí lára.
Nakokí kolíya biténi bya milungi, ekosála ngáí mabé tɛ́.
Naweza kula bilauri na sikunyui.
Saya boleh makan kaca dan ia tidak mencederakan saya.
Kaya kong kumain nang bubog at hindi ako masaktan.
Siña yo' chumocho krestat, ti ha na'lalamen yo'.
Au rawa ni kana iloilo, ia au sega ni vakacacani kina.
Aku isa mangan beling tanpa lara.
က္ယ္ဝန္‌တော္‌၊က္ယ္ဝန္‌မ မ္ယက္‌စားနုိင္‌သည္‌။ ၎က္ရောင္‌့ ထိခုိက္‌မ္ဟု မရ္ဟိပာ။ (9)
ကျွန်တော် ကျွန်မ မှန်စားနိုင်တယ်။ ၎င်းကြောင့် ထိခိုက်မှုမရှိပါ။ (9)
Tôi có thể ăn thủy tinh mà không hại gì.
些 𣎏 世 咹 水 晶 𦓡 空 𣎏 害 咦
ខ្ញុំអាចញុំកញ្ចក់បាន ដោយគ្មានបញ្ហារ
ຂອ້ຍກິນແກ້ວໄດ້ໂດຍທີ່ມັນບໍ່ໄດ້ເຮັດໃຫ້ຂອ້ຍເຈັບ.
ฉันกินกระจกได้ แต่มันไม่ทำให้ฉันเจ็บ
Би шил идэй чадна, надад хортой биш
ᠪᠢ ᠰᠢᠯᠢ ᠢᠳᠡᠶᠦ ᠴᠢᠳᠠᠨᠠ ᠂ ᠨᠠᠳᠤᠷ ᠬᠣᠤᠷᠠᠳᠠᠢ ᠪᠢᠰᠢ
म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।
ཤེལ་སྒོ་ཟ་ནས་ང་ན་གི་མ་རེད།
我能吞下玻璃而不伤身体。
我能吞下玻璃而不傷身體。
Góa ē-tàng chia̍h po-lê, mā bē tio̍h-siong.
私はガラスを食べられます。それは私を傷つけません。
나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
Mi save kakae glas, hemi no save katem mi.
Hiki iaʻu ke ʻai i ke aniani; ʻaʻole nō lā au e ʻeha.
E koʻana e kai i te karahi, mea ʻā, ʻaʻe hauhau.
ᐊᓕᒍᖅ ᓂᕆᔭᕌᖓᒃᑯ ᓱᕋᙱᑦᑐᓐᓇᖅᑐᖓ
Naika məkmək kakshət labutay, pi weyk ukuk munk-sik nay.
Tsésǫʼ yishą́ągo bííníshghah dóó doo shił neezgai da.
mi kakne le nu citka le blaci .iku'i le se go'i na xrani mi
Ljœr ye caudran créneþ ý jor cẃran.
EOD;
$unicodeSmallNoise = '्သ프ל';
function mb_str_split(string $string) {
$codepoints = [];
while (mb_strlen($string) > 0) {
$codepoints[] = mb_substr($string, 0, 1, 'UTF-8');
$string = mb_substr($string, 1, null, 'UTF-8');
}
return $codepoints;
}
function mb_str_shuffle(string $string) {
$codepoints = mb_str_split($string);
shuffle($codepoints);
return implode('', $codepoints);
}
// This string was made by copying pieces from $iCanEatGlass and then shuffling
$unicodeBigNoise = <<<'EOD'
بກ身体íа能čმরसᕋగوh晶ാủtດ་с་سనᛚTիᚷ่世اзตວйnིிåကᚈَǫىᙱsiéhُэys ⠇ាهိངvტzাکهໄkᓱགలြíرíᚾु吞مžłᚩbíมčमنáhكiనնʼแí⠛шغoकsᚔdপت̍ɐ害ញடدsו້ന्ငכចכ
EOD;
$glassStart = 'śaknomyattum';
$glassMiddle = 'ebbes ausmache';
$glassEnd = 'cẃran';
// Use non-ascii
if ($utf8test) {
$smallNoise = $unicodeSmallNoise;
$bigNoise = $unicodeBigNoise;
$lorem = $iCanEatGlass;
$loremStart = $glassStart;
$loremMiddle = $glassMiddle;
$loremEnd = $glassEnd;
}
$stringPairs =
[ ['a', 'a', 'a'] // 0
, ['a', 'b', ''] // 1
, ['b', 'a', ''] // 2
, ['', 'a', ''] // 3
, ['a', '', ''] // 4
// exact
, [ $lorem, $loremStart, $loremStart] // start 5
, [ $lorem, $loremMiddle, $loremMiddle] // end 6
, [ $lorem, $loremEnd, $loremEnd] // middle 7
// prefix noise SHORT
, [ $lorem, $smallNoise . $loremStart, $loremStart] // start 8
, [ $lorem, $smallNoise . $loremMiddle, $loremMiddle] // end 9
, [ $lorem, $smallNoise . $loremEnd, $loremEnd] // middle 10
// postfix noise SHORT
, [ $lorem, $loremStart . $smallNoise, $loremStart] // start 11
, [ $lorem, $loremMiddle . $smallNoise, $loremMiddle] // end 12
, [ $lorem, $loremEnd . $smallNoise, $loremEnd] // middle 13
// prefix + postfix noise SHORT
, [ $lorem, $smallNoise . $loremStart . $smallNoise, $loremStart] // start 14
, [ $lorem, $smallNoise . $loremMiddle . $smallNoise, $loremMiddle] // end 15
, [ $lorem, $smallNoise . $loremEnd . $smallNoise, $loremEnd] // middle 16
// prefix noise LONG
, [ $lorem, $bigNoise . $loremStart, $loremStart] // start 17
, [ $lorem, $bigNoise . $loremMiddle, $loremMiddle] // end 18
, [ $lorem, $bigNoise . $loremEnd, $loremEnd] // middle 19
// postfix noise LONG
, [ $lorem, $loremStart . $bigNoise, $loremStart] // start 20
, [ $lorem, $loremMiddle . $bigNoise, $loremMiddle] // end 21
, [ $lorem, $loremEnd . $bigNoise, $loremEnd] // middle 22
// prefix + postfix noise LONG
, [ $lorem, $bigNoise . $loremStart . $bigNoise, $loremStart] // start 23
, [ $lorem, $bigNoise . $loremMiddle . $bigNoise, $loremMiddle] // end 24
, [ $lorem, $bigNoise . $loremEnd . $bigNoise, $loremEnd] // middle 25
];
$testNr = 1;
$results = [];
// This one modifies output to lowercase !!
// foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
// if ($s1 === '' || $s2 === '') {
// $results[$k][$testNr] = null;
// continue;
// }
// $start = microtime(true);
// for ($i = 0; $i <= $iterations; $i++) {
// $output = longest_common_substring([$s1, $s2]);
// }
// $results[$k][$testNr] = microtime(true) - $start;
// if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
// }
// printf("Finished test %d\n", $testNr);
// $testNr++;
// This one is so slow it's not even worth benchmarking
// foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
// $start = microtime(true);
// for ($i = 0; $i <= $iterations; $i++) {
// getLongestMatchingSubstring($s1, $s2);
// }
// $results[$k][$testNr] = microtime(true) - $start;
// }
// printf("Finished test %d\n", $testNr);
// $testNr++;
// Found to be problematic
// foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
// $start = microtime(true);
// for ($i = 0; $i <= $iterations; $i++) {
// $output = get_longest_common_subsequence2($s1, $s2, true);
// }
// $results[$k][$testNr] = microtime(true) - $start;
// if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
// }
// printf("Finished test %d\n", $testNr);
// $testNr++;
// Found to be problematic
// foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
// $start = microtime(true);
// for ($i = 0; $i <= $iterations; $i++) {
// $output = get_longest_common_subsequence2($s1, $s2, false);
// }
// $results[$k][$testNr] = microtime(true) - $start;
// if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
// }
// printf("Finished test %d\n", $testNr);
// $testNr++;
// https://www.programmingalgorithms.com/algorithm/longest-common-substring?lang=PHP
foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
$start = microtime(true);
$output = '';
for ($i = 0; $i <= $iterations; $i++) {
LongestCommonSubstring($s1, $s2, $output);
}
$results[$k][$testNr] = microtime(true) - $start;
if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
}
printf("Finished test %d\n", $testNr);
$testNr++;
// http://benwendt.ca/articles/longest-common-substring-in-php/
foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
$start = microtime(true);
for ($i = 0; $i <= $iterations; $i++) {
$output = longest_common_substring2($s1, $s2);
}
$results[$k][$testNr] = microtime(true) - $start;
if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
}
printf("Finished test %d\n", $testNr);
$testNr++;
// https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#PHP
foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
$start = microtime(true);
for ($i = 0; $i <= $iterations; $i++) {
$output = get_longest_common_subsequence($s1, $s2);
}
$results[$k][$testNr] = microtime(true) - $start;
if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
}
printf("Finished test %d\n", $testNr);
$testNr++;
// https://gist.github.com/relaxnow/1276502
foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
$start = microtime(true);
for ($i = 0; $i <= $iterations; $i++) {
$output = getLongestCommonSubstring($s1, $s2);
}
$results[$k][$testNr] = microtime(true) - $start;
if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
}
printf("Finished test %d\n", $testNr);
$testNr++;
// https://gist.github.com/relaxnow/1276502#gistcomment-2762941
foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
if ($s1 === '' || $s2 === '') {
$results[$k][$testNr] = null;
continue;
}
$start = microtime(true);
for ($i = 0; $i <= $iterations; $i++) {
$output = lcs2($s1, $s2);
}
$results[$k][$testNr] = microtime(true) - $start;
if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
}
printf("Finished test %d\n", $testNr);
$testNr++;
// https://gist.github.com/relaxnow/1276502#gistcomment-2779505
foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
if ($s1 === '' || $s2 === '') {
$results[$k][$testNr] = null;
continue;
}
$start = microtime(true);
for ($i = 0; $i <= $iterations; $i++) {
$output = lcs2mb($s1, $s2);
}
$results[$k][$testNr] = microtime(true) - $start;
if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
}
printf("Finished test %d\n", $testNr);
$testNr++;
// This souce code
foreach ($stringPairs as $k => [$s1, $s2, $expectation]) {
$start = microtime(true);
for ($i = 0; $i <= $iterations; $i++) {
$output = strlcs($s1, $s2);
}
$results[$k][$testNr] = microtime(true) - $start;
if ($output !== $expectation) printf("Expected \"%s\" in pair %d, but got \"%s\"\n", $expectation, $k, $output);
}
printf("Finished test %d\n", $testNr);
$testNr++;
function printMarkdownTable($results) {
$spacing = 0;
foreach ($results as $test) {
foreach ($test as $algo) {
$len = strlen($algo);
if ($len > $spacing) $spacing = $len;
}
}
$headerCount = count($results[0]);
for ($i = 1; $i <= $headerCount; $i++) {
printf("| %-" . $spacing . "s ", $i);
}
printf("|\n");
for ($i = 0; $i < $headerCount; $i++) {
printf("|--" . str_repeat('-', $spacing));
}
printf("|\n");
foreach ($results as $test) {
printf("| %s |\n", implode(' | ', array_map(
function($time) use ($spacing) { return sprintf("%-" . $spacing . "s", $time);
}, $test)));
}
}
$resultsPerIteration = array_map(function($test) use ($iterations) {
return array_map(function($algo) use ($iterations) {
return $algo === null ? null : $algo / $iterations;
}, $test);
}, $results);
$results_to_string = array_map(function($test) use ($decimals) {
return array_map(function($algo) use ($decimals) {
return $algo === null ? '' : (string) round($algo, $decimals);
}, $test);
}, $resultsPerIteration);
$results_to_percentages = array_map(function($test) {
$min = PHP_FLOAT_MAX;
foreach ($test as $k => $algo) {
if ($algo < $min) {
$min = $algo;
$minPos = $k;
}
}
$return = [];
foreach ($test as $k => $algo) {
if ($k === $minPos) {
$return[] = '100%';
} elseif ($min == 0) {
$return[] = 'n/a';
} else {
$return[] = round(($algo / $min) * 100) . '%';
}
}
return $return;
}, $results);
// printMarkdownTable($results_to_string);
// printf("\n");
printMarkdownTable($results_to_percentages);
@flip111
Copy link
Author

flip111 commented Mar 7, 2019

Results

The result tables have gaps when the algorithm could not handle empty input strings.

ASCII test input

Iterations: 814

1 2 3 4 5 6 7
341% 505% 388% 299% 122% 100% 172%
198% 278% 206% 109% 105% 100% 140%
199% 284% 206% 113% 258% 100% 146%
n/a n/a n/a n/a 100% n/a n/a
n/a n/a n/a n/a 100% n/a n/a
395359% 377746% 310028% 319261% 119% 100% 148%
358314% 338090% 330073% 287168% 102% 100% 123%
213539% 202909% 177953% 170014% 100% 114% 116%
12253% 10892% 9414% 9289% 100% 100% 107%
22558% 20014% 17179% 17315% 112% 100% 100%
19612% 17881% 15159% 15238% 106% 100% 100%
18042% 15806% 13296% 13235% 121% 104% 100%
32460% 27260% 22893% 23247% 102% 100% 100%
29092% 25553% 21530% 22245% 100% 108% 105%
6452% 5403% 4344% 4471% 103% 100% 101%
10584% 8541% 7369% 7315% 103% 100% 101%
9861% 8166% 7305% 6991% 100% 101% 100%
388% 316% 297% 211% 102% 100% 108%
425% 359% 345% 256% 101% 102% 100%
424% 351% 328% 260% 101% 100% 100%
441% 351% 311% 230% 100% 102% 103%
438% 371% 339% 253% 100% 100% 105%
412% 360% 335% 243% 100% 100% 100%
182% 154% 146% 108% 100% 101% 100%
192% 162% 158% 114% 102% 102% 100%
189% 158% 158% 113% 101% 100% 101%

620.55user 59.46system 11:21.10elapsed 99%CPU (0avgtext+0avgdata 65900maxresident)k

UTF-8 test input

Iterations: 61

1 2 3 4 5 6 7
308% 486% 351% 305% 100% 103% 170%
171% 271% 182% 100% 109% 100% 135%
176% 291% 203% 100% 127% 100% 139%
n/a n/a n/a n/a 100% n/a n/a
n/a n/a n/a n/a 100% n/a n/a
2363608% 2013753% 1354273% 1298543% 117% 100% 113%
624189% 555878% 408292% 385367% 100% 105% 111%
800051% 751260% 445116% 440085% 100% 110% 111%
9284% 8024% 5329% 4905% 101% 100% 102%
9654% 8228% 5927% 5211% 100% 104% 104%
7861% 6765% 3993% 3686% 100% 101% 100%
10881% 9430% 6494% 5717% 101% 101% 100%
11180% 9539% 6557% 6355% 101% 104% 100%
7028% 6716% 3891% 3712% 101% 100% 103%
4390% 4046% 2701% 2329% 100% 102% 103%
4433% 3931% 2694% 2313% 101% 101% 100%
3186% 2760% 1772% 1619% 102% 100% 100%
498% 440% 332% 245% 106% 102% 100%
496% 417% 315% 242% 122% 100% 100%
467% 406% 294% 228% 125% 100% 113%
512% 447% 320% 252% 104% 100% 108%
520% 452% 330% 261% 113% 100% 102%
463% 406% 293% 243% 109% 102% 100%
345% 309% 215% 147% 109% 102% 100%
346% 303% 217% 145% 102% 104% 100%
332% 296% 208% 136% 102% 102% 100%

568.25user 42.73system 10:12.70elapsed 99%CPU (0avgtext+0avgdata 426596maxresident)k

Conclusion

Algorithm 5 seems to be the best. Below is the same algorithm (number 7 in chart) in my preferred coding style:

function strlcs(string $first, string $second) : string {
  if ($first === '' || $second === '') {
    return '';
  }

  $len1 = strlen($first);
  $len2 = strlen($second);

  if ($len1 < $len2) {
    $shortest = $first;
    $longest = $second;
    $len_shortest = $len1;
  } else {
    $shortest = $second;
    $longest = $first;
    $len_shortest = $len2;
  }
  
  $pos = strpos($longest, $shortest);
  if ($pos !== false) {
    return $shortest;
  }

  for ($i = 1, $j = $len_shortest - 1; $j > 0; --$j, ++$i) {
    for ($k = 0; $k <= $i; ++$k) {
      $substr = substr($shortest, $k, $j);
      $pos = strpos($longest, $substr);
      if ($pos !== false) {
        return $substr;
      }
    }
  }

  return '';
}

Caveats

On this particular test input multibyte characters don't seem to give a problem with the tested algorithms. However that doesn't mean some multibyte characters can be constructed with overlap that would be problematic. You have been warned.

Possibly lcs2mb (algorithm 6) is safe, but i didn't test this.

Memory usage is not considered.

No nice benchmark framework is used such as phpbench

Regex

There is a big potential here to leverage the PCRE engine. Algorithmically PCRE should be slower, but since the logic is written in PHP as opposed to PCRE in C, PCRE has the potential to be faster any way.

No (working) algorithm does this so far but there are opportunities here.

Native speed

Needless to say this algorithm would be fasted when being compiled to assembly. PHP extensions can be made in C or Rust.

Links

Some interesting links.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment