A quick & dirty levenshtein for PHP

What is this?

This is a implementation of the levenshtein algorithm to calculate the "distance" between two strings. It can be compiled as a module / extension for PHP (version 4.3 - 5.2 tested on 64bit-architectures). So, why not take the existing PHP functions levenshtein() and similar_text()? Well, in my application, the function is called to track how many changes a user made to a text he wrote before. Usually, the texts are rather long, but only very few changes are made. So PHP's levenshtein() does not work due to the limitation to 255 characters per string. For similar_text(), I was not exactly satisfied with the performance. My implementation is based on PHP one's, but uses a heuristic to get better performance when only few changes are made: first, it checks how much of the beginning and the end of the two strings are equal. Then, it does the good old levenshtein on the middle part. It's that easy. :-)

It still has O(n²) time complexity, but the n is reduced. If changes are made at the very beginning and the very end of the string, it's exactly as slow as the original levenshtein - but in my application's case, this is not very often the case. In our case, it works around 5 times faster than the original version on average.

Download

xxlevenshtein.tar.bz2 [1675 bytes]

To install it, just do the usual PHP-extension-stuff: phpize, ./configure, make, make install and adding it to php.ini (extension = xxlevenshtein.so).

If you find any bugs or have improvements for it, please let me know.

Written by: Tobias Hößl