the Subject announced in the title of the article is not new. On the Internet you can find plenty of issues, how to implement it, but the answers somewhat less. And not infrequently, they are reduced to the advice to use the third-party products, for example, the Sphinx. But often in the use of such cumbersome add-ins not needed.

Not so long ago there was a following problem:

the

Is part of an information system (is) database (DB) actors, musicians, DJs, and sometimes politicians, scientists and other public entities.
is filled with Amateurs and dilettantes, many of which are difficult to properly translate into Russian language of a particular name. And sometimes, and so it happened that the name sounded Russian, not as written in a foreign language.
In the database name would be placed in a fairly free form. The order of the words in the name are often changed (especially in Asian names), additionally could be specified (or not specified) nicknames and aliases.

there were just clerical errors — substitutions, omissions or additions of letters. Sometimes part of the letters for some reason were written in Latin. the

Random characters, fortunately, was not.
the Names were stored in a MyISAM tables in MySQL with UTF-8 encoding.
Total number of names in the database was approximately 138.5 thousand.

This is the fact that "this". And "required" are the following:

Automatic viewing of the database to identify and create pairs of similar by some criterion of the names that pass for further manual analysis on the subject of regardless of whether they belong to the same person or different. Moreover, the search had to have some paranoidally. I.e. in contrast to the "Perhaps You meant ..." was to give a few more names on the principle of "Better safe than sorry". And, hence, to analyze for similarity had a larger number of pairs. Moreover, it is not in real time, but for really the final time. Ie, not in days and weeks, and in the worst case, for a few hours.

In General, if briefly, there are nearly one hundred and fifty thousand names, nicknames, aliases, without division into separate fields (surname, name, patronymic, etc.), with arbitrary word order, misspellings, errors, different spellings. And the names have a variety of nationality — Russian, English, Hebrew, Polish, French, Korean, Chinese, Japanese, Indian, Indian, etc. Required to find all possible pairs such that strongly resemble the same name.

At first the task seemed extremely difficult. I, unfamiliar with fuzzy search man, it was hard to imagine formal criteria to judge the similarity or difference in the names.

But gradually began to form an understanding in used for these purposes approaches and algorithms. First of all, it was the algorithms specifically targeted or just well-established for finding errors in spelling of names:

the

Phonetic encoding (in particular, "the Russian metaphone applied" Peter Bankovskogo and its modifications[1]).
the Measure of similarity based on edit distances the Levenshtein[2].
Selection of common forms[3].
the Algorithm Jaro-Winkler[4].
N-g analysis.[5].

common reflection made was the following:

Phonetic encoding is not suitable due to the specifics of filling the database. Phonetic search is able to find names with errors at the hearing, but not translated by Amateurs-Amateurs. Let me explain by example:

Here comes to our own passport office, captain of the third rank of Jeannette Devereaux[6] and appears on the Charter clearly and loudly. And the passport may, at the hearing to write her name and "Devaru" and "Devaru" and even "Devero". Yes, the doubling of consonants in a name miss. All of these "auditory hallucinations" perfectly tracked by the algorithms of fuzzy search with a phonetic encoding.
But would-be"translator", not knowing the specifics of the French language (but familiar with the English language), rewriting data from a military identification card, can easily write, "Gannett Deveraux". And gifted can specify the calling or getting the name in front of their name.

Or even a English-speaking "translators" don't know what the name "Ralph Fiennes" should read not "Ralph", "Ralph Fiennes".

So, the phonetic coding is not suitable because in this case incorrect names are not written "as hearingIC", and "as theRTCR".

The attempt to apply the phonetic algorithms and to perceive not by hearing, but the eye may cause phonetic indexes of names will differ from each other even more than the register.

So, maybe use the edit distance?

It would be a bad idea if it weren't for the fact that these metrics are largely aimed at the comparison of words, not phrases free word order. Should change the name of locations or to insert between them an alias and they show a very unsatisfactory result.

Necessary in this case to split the names into separate words and compare each of them already. And this greatly complicates (and slows) the algorithm, even if you bring these individual words to their phonetic index. The additional complexity adds the fact that the number of words in a name is variable. To account for possible omissions and permutations of words, it is hard to even imagine the complexity of this process.

Other methods

The disadvantages of the metrics selection of common forms can be attributed to the greater time complexity of computing relevance in the uncertain indexing capabilities, especially short names. Even a single allocation of the maximum common substring is of time complexity O(nm). As well as it should then continue recursive search for common substrings in the residuals, the calculation of a relevant function appears to be quite an operation, losing even N g comparison.

Again, if you change the words of places, the relevance decreases sharply.

The calculation of the distance Jaro-Winkler works much faster. But, unfortunately, the same is not resistant to change words sometimes.

So, it was not necessary to divide and combine words in the name, Willy-nilly, had to stop N g analysis. Rather, even trigrammes, because short Korean names do not have to allocate is too long N-grams.

the

the first Step. Frontal strike.

So, to start, to get an idea about the time complexity trigrammes comparison, I've written the simple script, in pairs, comparing names across the database.

Writing and testing script completed at home on the computer eight years ago, even with single-core Intel Pentium D 925 motherboard ASUS P5B-E with the usual jaded hard Seagate about the same age.

The script worked, that is, in the forehead. Reads the database, each name is pre-normalized, were smashed in the unique trigrams and counted their number of occurrences in each of the previous names (i.e., in the DB before the test). If the result of dividing twice the number of coincidences in the amount of unique trigrams in each of these names than some threshold (for example, of 0.75 or 0.8), it is recorded in the file.

Normalization name was:

the

English letters and some numbers were replaced with a similar spelling in Russian letters.
, All lowercase letters replaced by uppercase.
the Letter "E" replaced by "E" and "b" in "B".
All other letters, signs, characters and their sequences were replaced by two spaces.
Name abramses double spaces.

As a result of this normalization was simplified the name, without punctuation, without translitera of characters in which each word is framed or separated from other words by double spaces.
The use of double spaces between words and at the beginning and end of the name, did the algorithm trigrammes analysis is completely insensitive to change places of words.

In normalized name stood out all sorts of unique trigrams — combinations of a row of three located characters (including spaces).

Then for the two names is calculated following the relevance coefficient: twice the number of shared trigrams was divided by the sum of unique trigrams in each name.

The requirement of uniqueness of the trigrams voiced for a reason. On the one hand, the calculation of the total number of trigrams the names without taking into account their uniqueness will allow us to confidently distinguish between such names as "John Smith" and "John John Smith". And accounting only unique trigrams will lead to the fact that these names are considered equal.

On the other hand, first, the need for distinguishing between similar names will not arise often. And, secondly, the calculation of relevancy for all trigrammes and not only on the unique, significantly complicate the procedure. Because instead of twice the number of common trigrams will have to count the amount of occurrences of the trigrams from the first name in the second and trigrams from the second name in first. That is, roughly this will increase the time cost of computing relevance is about 2 times. Well and, thirdly, it will not allow you to apply to the evaluation of the relevance of some optimizations, which will be discussed further.

So, the frontal algorithm was written and tested on the first few thousand names. The result has been dismal. As expected, the time matching throughout the array of the names had a distinct quadratic dependence on its size. Trend forecast to the processing of the entire array of 130-140 thousand names was about two years. So much!

But there were initial performance evaluation of the algorithm.

the

and the second Stage. Search of ways of optimization.

First of all, I had to assess which elements of the algorithm need to be optimized, and what is notorious "a bottleneck" hampering the whole system. Optimization areas were identified a few:

Initial fetch from the database should not give the whole array of names, and only those who are already at least a little similar to the test, to reduce the number of other operations. Analysis of execution time of different code fragments showed that the DB queries are "bottle neck". It was necessary to minimize the execution time of each query and the number of issued names of applicants in each sample.
comparison Operations and copying the substrings held with strings in multibyte UTF-8 encoding, the performance is several times inferior to the same operations with strings in single-byte encoding.
does Not necessarily produce a precise calculation of the relevance coefficient, if the proportion of shared trigrams is small. In this case, there is a possibility to interrupt this process as soon as it will become clear. even if all the other trigrams will be shared, it will not lead to exceeding the desired boundary values.
Other small things.

the

Step three. Implementation.

First of all, as the most simple, an attempt has been made to transition from the normalized encoding names UTF-8 to Windows-1251. However, it turned out that MySQL is very jealous of storing in the database strings in different encoding, even different tables. That leads to a non-localizable error during operations with databases. Therefore, the normalized name is stored in the additional join table is still in UTF-8 format, and if necessary re-encoded in Windows-1251.

Storing the normalized name saves time, since the normalization should be performed on each entry only once.

In addition, in the same attached table stores the number of unique trigrams in the normalized name. This value can be used to force completion of the computation of relevance in the case if it becomes apparent that too great a distinction between the names. To calculate each time the number of unique trigrams in the name is no longer necessary.
And, most importantly, treated names indexed trigrammus index. Ie in the additional table lists all the unique trigrams present in the normalized name. In this case, trigrams themselves, not to store them in a "slow" UTF-8, stored as a integer hash — Junior three-byte numbers correspond to the numeric values of corresponding characters in the character set Windows-1251.

But it was made completely ineffective, as was found in the future, the approach for each of the trigrams of the name check was done its own sampling of the names of applicants, which are then combined into a single array. Each name in this array was accompanied by a counter — how many samples it was found. It was assumed that it will give some gain in computing relevance due to the fact that it is not necessary will then put these names on trigrams — it was enough to use this counter to calculate the relevance.

Unfortunately, repeated sampling from the MySQL database (mostly) and the formation of an integrated array of names (to a lesser extent) gave the following overhead, which was much higher than the winning resulting from the simplification of the calculation of relevance.

These optimizations failed to achieve a significant reduction in time costs:

Application trigrammes index for the initial selection reduced the forecast result time from two years to less than three months. To use it as a numeric value instead of a substring in UTF-8 made it possible to further reduce the time of processing by another 10-12%. But still, 2.5 months of continuous work seemed too much time. A thin place remained the initial sample of similar names.

the

Step four. Improvement.

Multiple access to the database for each trigrammes index was not the best idea. Moreover, numerous samples were run long enough, and the consolidated array is very large, and its processing has required considerable time despite the use of the General counter of trigrams for each item.

Had to abandon the idea of this counter, but to the entire initial set of names of applicants for a single query using the SQL WHERE clause IN. This has reduced the forecast several times up to one month.

Next was the modified calculation procedure of relevance. On the basis of the lower boundary relevance, and the number of unique trigrams the names of the calculated maximum number of "misses" is the number of trigrams in the name of the applicant, not included in the check name. Then all trigrams behalf of the applicant was tested on the entry to check the name. Once the number of mistakes exceeded the maximum, the comparison was stopped. It is possible to reduce the prediction time by a few percent. But before the industrial values was still very far away.

the

Step five. Decision.

The task seemed insurmountable. On the one hand was required to significantly reduce the number of names in the initial sample, this reduction is not supposed to be too artificial, not to accidentally cut off names that are close to the lower limit of the valid range of relevance.

Find the approach easy viewing of found pairs of similar names. Almost all of these pairs were common subsequences with a length of 6-7 characters (including enclosing double spaces). By simple arguments (which are not listed here due to their lack of rigor) it can be concluded that to achieve relevance than 0.75, the normalized names must have a common substring of 5 characters or longer.

Thus, by indexing a preliminary search trigrammes go to indexing pentagrams (5-character substrings). To fit the hash of the pentagram in an integer variable, with 32-bit PHP version size 32 bits, each symbol represented as 5-bits, good only 31 characters (without the "E" and "b") and a space.
the

Another small Supplement.

As mentioned above, when calculating the relevance of the unique trigrams, one of the names was compared for coincidence with trigrammic middle name. The calculation was stopped if the maximum number of "misses" the maximum quantity allowed.

If the number of unique trigrams in the first name much less than the number of unique trigrams in the second, the calculated maximum number of misses may be negative. This means that these names are well known not to be similar to a target value relevance. And check them no need.

But this cutoff is not running in the opposite case, if the number of unique trigrams in the first name much larger than the number of those in the second name. Therefore, in addition to the query sample was introduced the boundary values of the number of unique trigrams in the normalized name.

It is somewhat debatable improvement. When setting the boundary values of the relevancy of 0.75 does not cause the acceleration, and, on the contrary, leads to a performance loss of 3% due to the increase in complexity of the sample. But if you set the border relevance of 0.8, we get for 3% win.

However, this technique was abandoned. The loss is not so great, but at the beginning you can do a preliminary search for similar couples with a high degree of relevance. And only after pre-cleaning to set the threshold value equal to 0.75.

the

Result.

As a result of the transition from initial sample total trigrammes to selection pentagrams managed to significantly speed up the script. A similar search with a relevance of 0.8 pairs among 138.5 thousand names was 5 hours instead of one month. Ie, a single search for similar names in entire database for a given name is (if we take the quadratic dependence of processing time) about 0.26 s. Multiple lot, but we must remember that this was a test run not on the server with a powerful processor and high-performance disk system, and half-dead on the home computer of eight years ago.

In principle, it would be possible to carry out the initial search even hexagramme (index substring of 6 characters). This gave acceleration to a few more times. But there is a legitimate concern that can be screened out many pairs of Asian short names (and not only them). Yes and no special meaning, because subsequent manual analysis of the obtained pairs (and they had accumulated more than 11 thousand) is in any case a few months.

But indexing hexagramme quite suitable for tips of the form "did You mean ..." when excessive paranoidness is not required, and instead, find the minimum number of the most appropriate names.

In principle, when dealing with long names, you can use the index semisymbolic substrings. And to put them in 32-bit variables to pre-phonetic coding with the aim of reducing the number of characters to 15 and a space. But such a task at that moment was not.

Unfortunately, shortly after writing the script, the website was blocked on charges of copyright infringement. Although he soon moved to another address, due to the leaning problem was not to search for duplicate names.

the

Snippets algorithm:

Structure of the DB tables

CREATE TABLE IF NOT EXISTS `prs_persons` (
`id` int(10) unsigned NOT NULL,
`name_ru` varchar(255) collate utf8_unicode_ci is a default of NULL, //Translated into Russian language of a name
//Other fields
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci is a;

CREATE TABLE IF NOT EXISTS `prs_normal` (
`id` int(10) unsigned NOT NULL,
`name` varchar(255) collate utf8_unicode_ci is a default of NULL, //Normalized name
`num` int(2) unsigned NOT NULL, //Number of unique trigrams in the normalized name
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci is a;

CREATE TABLE IF NOT EXISTS `prs_5gramms` (
`code` int(10) unsigned NOT NULL, //Numeric code of the pentagram
`id` int(10) unsigned NOT NULL, //name Identifier to which there is
KEY `code` (`code`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci is a;


 
Constants and auxiliary variables, advanced actions: $opt_debug_show_sql = 0;
$opt_debug_mode = 0;
$opt_table_prefix = "prs";

define('MIN_RELEVANT', 0.8);
define('SITE_PREFIX', "..."); //Here is stored the prefix page of the person on the site.

//Maximum allowable ratio of the number of unique trigrams
//the below names could be similar.
$len_ratio = MIN_RELEVANT / (2 - MIN_RELEVANT);

$suitablesin = iconv("Windows-1251", "UTF-8",
"АаБбВвГгДдЕеЕеЖжЗзИиЙйКкЛлМмНнОоппррссттууффххццччшшщщьъыыъьээююяяaabbcceeghkkmmnoopprtuxxyy03468");
$suitablesout =
"ААББВВГГДДЕЕЕЕЖЖЗЗИИЙЙККЛЛММННООППРРССТТУУФФХХЦЦЧЧШШЩЩЬЬЫЫЬЬЭЭЮЮЯЯААВЬССЕЕДНККМТНООРРГТИХХУУОЗЧБВ";

$charcode = array( '' => 0, 'A' = > 1, 'B' = > 2, 'To' => 3, 'G' = > 4, 'D' => 5, 'E' = > 6, 'W' => 7,
'S' => 8, 'And' => 9, 'Th' => 10, 'To' => 11, 'L' = > 12, 'M' = > 13, 'N' = > 14, 'About' => 15,
'P' = > 16, 'R' = > 17, 'S' = > 18, 'T' = > 19, 'Y' = > 20, 'f' = > 21, 'X' = > 22, 'C' => 23,
'H' = > 24, 'W' = > 25, 'Y' = > 26, 'B' = > 27, 'S' = > 28, 't' = > 29, 'u' = > 30, 'I' => 31); //Pyatiletie character codes.

set_time_limit(0); //script execution Time limit.

function message_die($errno, $error, $file, $line)
{
if ($errno)
{
print "<p><b>Error " . $errno . "" . $file . "(" . $line . "):</b> " . $error;
die();
}
};

$fout = fopen("...", "w"); //Open the file for writing similar pairs.

//Here should be the preparation of a list of names for analysis and a basic cycle comparison.

fclose($fout);

 

 
the Body of the main loop of the comparison://For everyone else unparsed name:

$id = ...; $name_ru = ...; //Read the person ID and name in UTF-8.

$rusn = " "; //the Normalized name starts by two spaces.
$ischar = FALSE;

for ($j = 0; $j < mb_strlen($name_ru, "UTF-8"); $j++)
{
$char = mb_substr($name_ru, $j, 1, "UTF-8");

if (($pos = mb_strpos($suitablesin, $char, 0, "UTF-8")) === FALSE)
{
if ($ischar)
{
//All invalid characters are replaced by double spaces.
$rusn .= "";
$ischar = FALSE;
}
}
else
{
//Other perekodirovat in Windows-1251, taking into account possible transliteration.
$rusn .= $suitablesout{$pos};
$ischar = TRUE;
}
}

if ($ischar) $rusn .= ""; //Add two spaces at the end.

if (strlen($rusn) < 5) continue; // the Names of some of the spaces are not considered.

$norm = iconv("Windows-1251", "UTF-8", $rusn); //Perekodirovat to UTF-8 for writing to the database.

//Put in array of unique pentagram normalized name:

$subgramms = array();

//Normalized name starts with two blanks with code 0.
//Therefore, they do not encode it, and start the hash computation of the pentagram with the first letter.

$code = ($charcode[$rusn{2}] << 5) | $charcode[$rusn{3}];

for ($j = 4; $j < strlen($rusn); $j++)
{
$code = (($code << 5) | $charcode[$rusn{$j}]) &0x1FFFFFF;

$subgramms[$code] = $code; //write the hash of the pentagram in the array.
}

//Compile a list of unique trigrams in the normalized name and count their number.

$trigramms = array();

for ($k = 0; $k < strlen($rusn) - 2; $k++)
$trigramms[$trigramm = substr($rusn, $k, 3)] = $trigramm;

$n = count($trigramms);

//Calculate the boundary values of the number of unique trigrams in similar names:

$nmin = ceil($n * $len_ratio);
$nmax = floor($n / $len_ratio);

//Read from DB the list of the candidates for the similar names.

$similars = fquery("SELECT n.id AS id, n.name AS name, n.num AS num FROM ^@5gramms AS g ^@normal AS n WHERE g.code IN (^N) AND n.id = g.id AND 

n.num >= ^N AND n.num <= ^N", $subgramms, $nmin, $nmax);

//Written to DB normalized name and the number of unique trigrams.

fquery("INSERT INTO ^@normal (id, name, num) VALUES (^N, ^S, ^N)", $id, $norm, $n);

//Written to the database list of pentagrams that normalized name.

foreach ($subgramms as $key=>$code)
fquery("INSERT INTO ^@5gramms (code, id) VALUES (^N, ^N)", $code, $id);

unset($subgramms); //And release it.

//For each name candidate:

for ($i = 0; $i < @mysql_num_rows($similars); $i++)
{
$similar = @mysql_fetch_assoc($similars) OR
message_die(@mysql_errno(), @mysql_error(), __FILE__, __LINE__);

$name = iconv("UTF-8", "Windows-1251", $similar['name']);
$simid = $similar['id'];
$m = $similar['num']; //the Number of unique trigrams.
$nm = 0; //the Number of common trigrams.
$done = TRUE;
$miss = floor($m - MIN_RELEVANT * ($n + $m) / 2); //the Number of allowed mistakes.

for ($k = 0; ($k < strlen($name) - 2) & & ($miss > = 0); $k++)
{
if (strpos($name, $trigramm = substr($name, $k, 3), 0) == $k)
{
//If the trigram was found for the first time (unique):
if (isset($trigramms[$trigramm])) $nm++; //Increase the counter total.
else $miss--; //Otherwise decrement the remaining number of allowed misses.
}
}

if ($miss > = 0)
//If the number of failures is not exceeded, output to a file information about
//a similar pair of name and value relevance.
fwrite($fout, SITE_PREFIX . $id ."\t" . $rusn ."\t" . SITE_PREFIX . $key . "\t" . $name . "\t" . ($nm + $nm) / ($m + $n) . "\r\n");


@mysql_free_result($similars) OR message_die(@mysql_errno(), @mysql_error(), __FILE__, __LINE__);

unset($trigramms); //free the list of trigrams for reuse.

fclose($fout); //Close the file with the list of similar pairs.

 

 
Function formatted query to MySQL[7]
 
the // Syntax: fquery($query_template_text, $argument's_1_value, $argument's_2_value, ...)
// Special characters for a query template:
// ^@TableName - indicates that the combination ^@ is to be replaced with table prefix
// ^N - numeric parameter(s) (is not to be quoted), separated by comma if is array
// ^S - the string parameter(s) (is to be quoted), separated by comma if is array
// ^0 - "NULL" or "NOT NULL"

// (c) Andrey Grigoryev aka aka GrAnd Pochemuk
// Thanks to Artjom Kamnev (Kamnium), Mesilov Maxim (Severus) for idea
// http://life.screenshots.ru

// When successed query returns Recordset for SELECT or True for others.
// When error occurs returns False.

function fquery()
{
global $opt_debug_mode;
global $opt_debug_show_sql;
global $opt_table_prefix; // getting prefix from the site's options

if (is_array(func_get_arg(0))) $args = func_get_arg(0);
else $args = func_get_args();

$qtext = $args[0]; // the first argument is always query template text
if (empty($qtext)) return false; // Hmm, nothing to do!

$qtext = str_replace("^@", ($opt_table_prefix == "") ? "" : ($opt_table_prefix . '_'), $qtext); // replacing with table prefixes

$i = 0; $curArg = 1;

$query = "";

while ($i < strlen($qtext)) // strlen is always up-to-date, even if special chars are replaced
{
if ($qtext{$i} == '^')
{
if ($curArg >= count($args)) return false; // too many parameters in the query template!

$i++;

switch ($qtext{$i})
{
case 'N':
{
if (is_null($args[$curArg]))
{
$query .= "NULL";
continue;
}

if (is_array($args[$curArg])) $query .= implode(", ", $args[$curArg]);
else $query .= $args[$curArg];

break;
}
case 'S':
{
if (is_null($args[$curArg]))
{
$query .= "NULL";
continue;
}

if (is_array($args[$curArg])) $query .= "'" . implode("', '", $args[$curArg]) . "'";
else $query .= "'" . $args[$curArg] . "'";

break;
}
case '0':
{
if (is_null($args[$curArg])) return false; // incorrect parameter, nulls are not allowed!

$args[$curArg] = strtoupper($args[$curArg]);

if (($args[$curArg] != "NULL") && ($args[$curArg] != "NOT NULL")) return false; // incorrect parameter, a "NULL" or 

"NOT NULL" only!

$query .= $args[$curArg];

break;
}
default: $query .= $qtext{$i};
}

$i++; $curArg++;
}
else $query .= $qtext{$i++};
}

if ($opt_debug_show_sql == 1) print('<P><CODE>Query string: "' . $query . '"<CODE></P>' . "\r\n");

$ResultData = mysql_query($query);

if (mysql_errno() <> 0)
{
if ($opt_debug_mode == 1)
{
print('<P><CODE>error MySQL: #' . mysql_errno() . ': '. mysql_error() . ' 
');
print('Query string:' . $query . '</CODE></P>');
}
}

return($ResultData);
} // End of fquery

 

 
the test run Result at the first of thousands of records:

 
the operation Result 
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
the  
ADDRESS 1 
NAME 1 
ADDRESS 2 
NAME 2 
RELEVANCE 
 
/people/784 
 PETER JASON  
/people/389 
 PETER JACKSON  
0.8125 
 
/people/1216 
 CHARLES DANCE  
/people/664 
 CHARLES DENNIS  
0.8 
 
/people/1662 
 STUART f WILSON  
/people/1251 
 STUART WILSON  
0.914285714286 
 
/people/1798 
 MICHAEL MANN  
/people/583 
 MICHAEL LEHMANN  
0.846153846154 
 
/people/2062 
 MICHAEL BYRNE  
/people/265 
 MICHAEL biehn  
0.8 
 
/people/2557 
 Geena DAVIS  
/people/963 
 GENE DAVIS  
0.8 
 
/people/3093 
 J. J. JOHNSON  
/people/911 
 DON JOHNSON  
0.818181818182 
 
/people/3262 
 TOM EVERETT  
/people/586 
 TOM EVERETT SCOTT  
0.848484848485 
 
/people/3329 
 ROBERT RITTY  
/people/3099 
 ROBERT BEATTY  
0.827586206897 
 
/people/3585 
 TRACY KAY WOLFE  
/people/2810 
 TRACI WOLFE  
0.857142857143 
 
/people/3598 
 ALEXANDER KALUGIN  
/people/2852 
 ALEXANDER KALYAGIN  
0.85 
 
/people/3966 
 SERGEI BODROV  
/people/2991 
 SERGEI BODROV ML  
0.888888888889 
 
/people/3994 
 SERGEI BOBROV  
/people/3966 
 SERGEI BODROV  
0.8125 
 
/people/4049 
 RICHARD LEWIS  
/people/2063 
 RICHARD J. LEWIS  
0.882352941176 
 
/people/4293 
 JERRY LIVLI  
/people/2006 
 JERRY LEE  
0.88 
 
/people/4377 
 JOAN CUSACK  
/people/3774 
 JOHN CUSACK  
0.827586206897 
 
/people/4396 
 DEAN MCDERMOTT  
/people/2614 
 DYLAN MCDERMOTT  
0.833333333333 
 
/people/4608 
 SHAWN JOHNSTON  
/people/2036 
 J. J. JOHNSTON  
0.8 
 
/people/4981 
 CHRISTOPHER MEI  
/people/3233 
 CHRISTOPHER MURRAY  
0.8 
 
/people/5019 
 JANE ALEXANDER  
/people/381 
 JASON ALEXANDER  
0.842105263158 
 
/people/5551 
 CARLOS ANDRES GOMEZ  

 CARLOS GOMEZ  
0.810810810811 
 
/people/5781 
 ALEX NOYBERGER  
/people/4288 
 ALEX NEUBERGER  
0.8 
 
/people/5839 
 JOEY TRAVOLTA  
/people/935 
 JOHN TRAVOLTA  
0.8125 
 
/people/5917 
 JOE JOHNSTON  
/people/2036 
 J. J. JOHNSTON  
0.833333333333 
 
/people/5917 
 JOE JOHNSTON  
/people/4608 
 SHAWN JOHNSTON  
0.8 
 
/people/6112 
 THOMAS RYAN  
/people/4869 
 THOMAS J. RYAN  
0.823529411765 
 
/people/6416 
 BRIAN GEORGE  
/people/3942 
 GEORGE BRYANT  
0.848484848485 
 
/people/6520 
 JOHN CARNEY  
/people/5207 
 JOHN KANI  
0.8 
 
/people/6834 
 JOHN J. ANDERSON  
/people/5049 
 JOE ANDERSON  
0.838709677419 
 
/people/6836 
 MICHAEL ESTON  
/people/5056 
 MICHAEL WESTON  
0.827586206897 
 
/people/6837 
 DAVID BARON  
/people/5884 
 DAVID BARON  
0.827586206897 
 
/people/7261 
 BILLY GRAY  
/people/1695 
 BILLY RAY  
0.8 
 
/people/7361 
 ALAN DAVID  
/people/3087 
 DAVID ALAN BASH  
0.838709677419 
 
/people/7447 
 DAVID AYER  
/people/2277 
 THAYER DAVID  
0.814814814815 
 
/people/7497 
 ALEKSANDR KARAMNOV  
/people/3857 
 ALEXANDER KARPOV  
0.8 
 
/people/7499 
 NICHOLAS LUMLEY  
/people/4424 
 NICHOLAS  
0.827586206897 
 
/people/7534 
 RICHARD REHL  
/people/3547 
 RICHARD NIL  
0.857142857143 
 
/people/7547 
 SVETLANA STARIKOV  
/people/1985 
 SVETLANA SVETIKOVA  
0.8 
 
/people/7677 
 JACE ALEXANDER  
/people/381 
 JASON ALEXANDER  
0.842105263158 
 
/people/7677 
 JACE ALEXANDER  
/people/5019 
 JANE ALEXANDER  
0.833333333333 
 
/people/8000 
 GREGORY SMITH  
/people/7628 
 GREGORY P SMITH  
0.909090909091 
 
/people/8137 
 CASPER CHRISTENSEN  
/people/128 
 JESPER CHRISTENSEN  
0.8 
 
/people/8186 
 SHANE KOSUGI  
/people/6235 
 KANE KOSUGI  
0.814814814815 
 
/people/8219 
 BRANDON JAMES OLSON  
/people/797 
 JAMES OLSON  
0.810810810811 
 
/people/8442 
 GUNNAR LAFSSON  
/people/7033 
 GUNNAR JONSSON  
0.8 
 
/people/8458 
 JOHN ALEXANDER  
/people/381 
 JASON ALEXANDER  
0.810810810811 
 
/people/8458 
 JOHN ALEXANDER  
/people/5019 
 JANE ALEXANDER  
0.8 
 
/people/8614 
 DAVID HERMAN  
/people/4945 
 DAVID HAYMAN  
0.8 
 
/people/8874 
 NICOLAS ROEG  
/people/1667 
 NICHOLAS ROWE  
0.827586206897 
 
/people/8987 
 DAVID ROSS  
/people/4870 
 DAVID CROSS  
0.814814814815 
 
/people/9132 
 ROBERT LONG  
/people/7683 
 ROBERT LONGO  
0.827586206897 
 
/people/9202 
 ROBERT MENDEL  
/people/3410 
 ROBERT MANDEL  
0.8125 
 
/people/9229 
 ASHLEY LAWRENCE  
/people/2534 
 ASHLEY LAWRENCE  
1 
 
/people/9303 
 JOHN ELLARD  
/people/8703 
 JOHN AYLWARD  
0.827586206897 
 
/people/9308 
 SEAN ROBERTS  
/people/6552 
 SEAN O ROBERTS  
0.903225806452 
 
/people/9347 
 STEPHEN SERJIK  
/people/2911 
 STEPHEN SURZHIK  
0.8 
 
/people/9432 
 POLLY SHANNON  
/people/2240 
 MOLLY SHANNON  
0.8 
 
/people/9583 
 JULIE HARRIS  
/people/904 
 JULIUS HARRIS  
0.838709677419 
 
/people/9788 
 ANTHONY STARR  
/people/8308 
 ANTHONY STARK  
0.8 
 
/people/9835 
 MICHAEL WATKINS  
/people/4727 
 MICHAEL WATKINS  
0.864864864865 
 
/people/9893 
 STEVE MARTINO  
/people/6457 
 STEVE MARTIN  
0.827586206897 
 

 

 
the References and notes

 
1. Phonetic encoding.
 
→ Phonetic algorithms. Habrahabr
 
→ "what is your name", or Russian MetaPhone (Metaphone Russian description)
 

 
2. the Levenshtein Distance. Wikipedia
 

3. calculating the similarity of words based on the allocation of common forms implemented in PHP function similar_text. In the PHP documentation States that this function is the algorithm Oliver [1993]. However, originally this algorithm is called "the Ratcliff/Obershelp Pattern Recognition Algorithm" was posted by John W. Ratcliff on the website programmers Dr. Dobb's (Pattern Matching: the Gestalt Approach) in 1988. And Ian Oliver just used it in his book "Programming Classics: Implementing the World's Best Algorithms".
 

 
4. the Source code of the algorithm Jaro-Winkler can be viewed for example here: Jaro-Winkler Distance
 

 
5. Trigeminy index or "Finding mistakes". Habrahabr
 

 
6. Jeannette "angel" Devereaux is a character from the cult media franchise "Wing Commander", which includes computer cosmochemistry, strategy, other forms of games, books and film. Here are only due to the fact that I do not understand the French pronunciation of names, have long believed that it sounds like "Deveraux". Illustration of erroneous translation is not "by ear" and "eye".
 

 
7. Basis functions formatted SQL queries honestly borrowed from this time Artem and Maxim Mazilova from the article "SQL injection: the struggle for pleasure". Unfortunately, the site these guys recently downloading a copy of the article can still be found: Sql injection: the struggle in pleasure (unfortunately, without sources). Removed some controversial features. Instead, they added the ability to pass as argument array.

ADDRESS 1	NAME 1	ADDRESS 2	NAME 2	RELEVANCE
/people/784	PETER JASON	/people/389	PETER JACKSON	0.8125
/people/1216	CHARLES DANCE	/people/664	CHARLES DENNIS	0.8
/people/1662	STUART f WILSON	/people/1251	STUART WILSON	0.914285714286
/people/1798	MICHAEL MANN	/people/583	MICHAEL LEHMANN	0.846153846154
/people/2062	MICHAEL BYRNE	/people/265	MICHAEL biehn	0.8
/people/2557	Geena DAVIS	/people/963	GENE DAVIS	0.8
/people/3093	J. J. JOHNSON	/people/911	DON JOHNSON	0.818181818182
/people/3262	TOM EVERETT	/people/586	TOM EVERETT SCOTT	0.848484848485
/people/3329	ROBERT RITTY	/people/3099	ROBERT BEATTY	0.827586206897
/people/3585	TRACY KAY WOLFE	/people/2810	TRACI WOLFE	0.857142857143
/people/3598	ALEXANDER KALUGIN	/people/2852	ALEXANDER KALYAGIN	0.85
/people/3966	SERGEI BODROV	/people/2991	SERGEI BODROV ML	0.888888888889
/people/3994	SERGEI BOBROV	/people/3966	SERGEI BODROV	0.8125
/people/4049	RICHARD LEWIS	/people/2063	RICHARD J. LEWIS	0.882352941176
/people/4293	JERRY LIVLI	/people/2006	JERRY LEE	0.88
/people/4377	JOAN CUSACK	/people/3774	JOHN CUSACK	0.827586206897
/people/4396	DEAN MCDERMOTT	/people/2614	DYLAN MCDERMOTT	0.833333333333
/people/4608	SHAWN JOHNSTON	/people/2036	J. J. JOHNSTON	0.8
/people/4981	CHRISTOPHER MEI	/people/3233	CHRISTOPHER MURRAY	0.8
/people/5019	JANE ALEXANDER	/people/381	JASON ALEXANDER	0.842105263158
/people/5551	CARLOS ANDRES GOMEZ	CARLOS GOMEZ	0.810810810811
/people/5781	ALEX NOYBERGER	/people/4288	ALEX NEUBERGER	0.8
/people/5839	JOEY TRAVOLTA	/people/935	JOHN TRAVOLTA	0.8125
/people/5917	JOE JOHNSTON	/people/2036	J. J. JOHNSTON	0.833333333333
/people/5917	JOE JOHNSTON	/people/4608	SHAWN JOHNSTON	0.8
/people/6112	THOMAS RYAN	/people/4869	THOMAS J. RYAN	0.823529411765
/people/6416	BRIAN GEORGE	/people/3942	GEORGE BRYANT	0.848484848485
/people/6520	JOHN CARNEY	/people/5207	JOHN KANI	0.8
/people/6834	JOHN J. ANDERSON	/people/5049	JOE ANDERSON	0.838709677419
/people/6836	MICHAEL ESTON	/people/5056	MICHAEL WESTON	0.827586206897
/people/6837	DAVID BARON	/people/5884	DAVID BARON	0.827586206897
/people/7261	BILLY GRAY	/people/1695	BILLY RAY	0.8
/people/7361	ALAN DAVID	/people/3087	DAVID ALAN BASH	0.838709677419
/people/7447	DAVID AYER	/people/2277	THAYER DAVID	0.814814814815
/people/7497	ALEKSANDR KARAMNOV	/people/3857	ALEXANDER KARPOV	0.8
/people/7499	NICHOLAS LUMLEY	/people/4424	NICHOLAS	0.827586206897
/people/7534	RICHARD REHL	/people/3547	RICHARD NIL	0.857142857143
/people/7547	SVETLANA STARIKOV	/people/1985	SVETLANA SVETIKOVA	0.8
/people/7677	JACE ALEXANDER	/people/381	JASON ALEXANDER	0.842105263158
/people/7677	JACE ALEXANDER	/people/5019	JANE ALEXANDER	0.833333333333
/people/8000	GREGORY SMITH	/people/7628	GREGORY P SMITH	0.909090909091
/people/8137	CASPER CHRISTENSEN	/people/128	JESPER CHRISTENSEN	0.8
/people/8186	SHANE KOSUGI	/people/6235	KANE KOSUGI	0.814814814815
/people/8219	BRANDON JAMES OLSON	/people/797	JAMES OLSON	0.810810810811
/people/8442	GUNNAR LAFSSON	/people/7033	GUNNAR JONSSON	0.8
/people/8458	JOHN ALEXANDER	/people/381	JASON ALEXANDER	0.810810810811
/people/8458	JOHN ALEXANDER	/people/5019	JANE ALEXANDER	0.8
/people/8614	DAVID HERMAN	/people/4945	DAVID HAYMAN	0.8
/people/8874	NICOLAS ROEG	/people/1667	NICHOLAS ROWE	0.827586206897
/people/8987	DAVID ROSS	/people/4870	DAVID CROSS	0.814814814815
/people/9132	ROBERT LONG	/people/7683	ROBERT LONGO	0.827586206897
/people/9202	ROBERT MENDEL	/people/3410	ROBERT MANDEL	0.8125
/people/9229	ASHLEY LAWRENCE	/people/2534	ASHLEY LAWRENCE	1
/people/9303	JOHN ELLARD	/people/8703	JOHN AYLWARD	0.827586206897
/people/9308	SEAN ROBERTS	/people/6552	SEAN O ROBERTS	0.903225806452
/people/9347	STEPHEN SERJIK	/people/2911	STEPHEN SURZHIK	0.8
/people/9432	POLLY SHANNON	/people/2240	MOLLY SHANNON	0.8
/people/9583	JULIE HARRIS	/people/904	JULIUS HARRIS	0.838709677419
/people/9788	ANTHONY STARR	/people/8308	ANTHONY STARK	0.8
/people/9835	MICHAEL WATKINS	/people/4727	MICHAEL WATKINS	0.864864864865
/people/9893	STEVE MARTINO	/people/6457	STEVE MARTIN	0.827586206897


Article based on information from habrahabr.ru

Поиск по этому блогу

computer express

Finding similar names by means of MySQL+PHP

Not so long ago there was a following problem:

This is the fact that "this". And "required" are the following:

common reflection made was the following:

So, maybe use the edit distance?

Other methods

the first Step. Frontal strike.

and the second Stage. Search of ways of optimization.

Step three. Implementation.

Step four. Improvement.

Step five. Decision.

Another small Supplement.

Result.

Snippets algorithm:

test run Result at the first of thousands of records:

References and notes

Популярные сообщения из этого блога

Approval of WSUS updates: import, export, copy

The Hilbert curve vs. Z-order

Kaspersky Security Center — the fight for automation