Finding similar names by means of MySQL+PHP

the Subject announced in the title of the article is not new. On the Internet you can find plenty of issues, how to implement it, but the answers somewhat less. And not infrequently, they are reduced to the advice to use the third-party products, for example, the Sphinx. But often in the use of such cumbersome add-ins not needed.

Not so long ago there was a following problem:


the
    the
  • Is part of an information system (is) database (DB) actors, musicians, DJs, and sometimes politicians, scientists and other public entities.
  • the
  • is filled with Amateurs and dilettantes, many of which are difficult to properly translate into Russian language of a particular name. And sometimes, and so it happened that the name sounded Russian, not as written in a foreign language.
  • the
  • In the database name would be placed in a fairly free form. The order of the words in the name are often changed (especially in Asian names), additionally could be specified (or not specified) nicknames and aliases.
  • there were just clerical errors — substitutions, omissions or additions of letters. Sometimes part of the letters for some reason were written in Latin. the

  • Random characters, fortunately, was not.
  • the
  • the Names were stored in a MyISAM tables in MySQL with UTF-8 encoding.
  • the
  • Total number of names in the database was approximately 138.5 thousand.

This is the fact that "this". And "required" are the following:


Automatic viewing of the database to identify and create pairs of similar by some criterion of the names that pass for further manual analysis on the subject of regardless of whether they belong to the same person or different. Moreover, the search had to have some paranoidally. I.e. in contrast to the "Perhaps You meant ..." was to give a few more names on the principle of "Better safe than sorry". And, hence, to analyze for similarity had a larger number of pairs. Moreover, it is not in real time, but for really the final time. Ie, not in days and weeks, and in the worst case, for a few hours.

In General, if briefly, there are nearly one hundred and fifty thousand names, nicknames, aliases, without division into separate fields (surname, name, patronymic, etc.), with arbitrary word order, misspellings, errors, different spellings. And the names have a variety of nationality — Russian, English, Hebrew, Polish, French, Korean, Chinese, Japanese, Indian, Indian, etc. Required to find all possible pairs such that strongly resemble the same name.



At first the task seemed extremely difficult. I, unfamiliar with fuzzy search man, it was hard to imagine formal criteria to judge the similarity or difference in the names.

But gradually began to form an understanding in used for these purposes approaches and algorithms. First of all, it was the algorithms specifically targeted or just well-established for finding errors in spelling of names:

the
    the
  • Phonetic encoding (in particular, "the Russian metaphone applied" Peter Bankovskogo and its modifications[1]).
  • the
  • the Measure of similarity based on edit distances the Levenshtein[2].
  • the
  • Selection of common forms[3].
  • the
  • the Algorithm Jaro-Winkler[4].
  • the
  • N-g analysis.[5].

common reflection made was the following:


Phonetic encoding is not suitable due to the specifics of filling the database. Phonetic search is able to find names with errors at the hearing, but not translated by Amateurs-Amateurs. Let me explain by example:



Here comes to our own passport office, captain of the third rank of Jeannette Devereaux[6] and appears on the Charter clearly and loudly. And the passport may, at the hearing to write her name and "Devaru" and "Devaru" and even "Devero". Yes, the doubling of consonants in a name miss. All of these "auditory hallucinations" perfectly tracked by the algorithms of fuzzy search with a phonetic encoding.
But would-be"translator", not knowing the specifics of the French language (but familiar with the English language), rewriting data from a military identification card, can easily write, "Gannett Deveraux". And gifted can specify the calling or getting the name in front of their name.

Or even a English-speaking "translators" don't know what the name "Ralph Fiennes" should read not "Ralph", "Ralph Fiennes".

So, the phonetic coding is not suitable because in this case incorrect names are not written "as hearingIC", and "as theRTCR".

The attempt to apply the phonetic algorithms and to perceive not by hearing, but the eye may cause phonetic indexes of names will differ from each other even more than the register.

So, maybe use the edit distance?


It would be a bad idea if it weren't for the fact that these metrics are largely aimed at the comparison of words, not phrases free word order. Should change the name of locations or to insert between them an alias and they show a very unsatisfactory result.

Necessary in this case to split the names into separate words and compare each of them already. And this greatly complicates (and slows) the algorithm, even if you bring these individual words to their phonetic index. The additional complexity adds the fact that the number of words in a name is variable. To account for possible omissions and permutations of words, it is hard to even imagine the complexity of this process.

Other methods


The disadvantages of the metrics selection of common forms can be attributed to the greater time complexity of computing relevance in the uncertain indexing capabilities, especially short names. Even a single allocation of the maximum common substring is of time complexity O(nm). As well as it should then continue recursive search for common substrings in the residuals, the calculation of a relevant function appears to be quite an operation, losing even N g comparison.

Again, if you change the words of places, the relevance decreases sharply.

The calculation of the distance Jaro-Winkler works much faster. But, unfortunately, the same is not resistant to change words sometimes.

So, it was not necessary to divide and combine words in the name, Willy-nilly, had to stop N g analysis. Rather, even trigrammes, because short Korean names do not have to allocate is too long N-grams.

the

the first Step. Frontal strike.


So, to start, to get an idea about the time complexity trigrammes comparison, I've written the simple script, in pairs, comparing names across the database.

Writing and testing script completed at home on the computer eight years ago, even with single-core Intel Pentium D 925 motherboard ASUS P5B-E with the usual jaded hard Seagate about the same age.

The script worked, that is, in the forehead. Reads the database, each name is pre-normalized, were smashed in the unique trigrams and counted their number of occurrences in each of the previous names (i.e., in the DB before the test). If the result of dividing twice the number of coincidences in the amount of unique trigrams in each of these names than some threshold (for example, of 0.75 or 0.8), it is recorded in the file.

Normalization name was:

the
    the
  • English letters and some numbers were replaced with a similar spelling in Russian letters.
  • the
  • , All lowercase letters replaced by uppercase.
  • the
  • the Letter "E" replaced by "E" and "b" in "B".
  • the
  • All other letters, signs, characters and their sequences were replaced by two spaces.
  • the
  • Name abramses double spaces.

As a result of this normalization was simplified the name, without punctuation, without translitera of characters in which each word is framed or separated from other words by double spaces.
The use of double spaces between words and at the beginning and end of the name, did the algorithm trigrammes analysis is completely insensitive to change places of words.

In normalized name stood out all sorts of unique trigrams — combinations of a row of three located characters (including spaces).

Then for the two names is calculated following the relevance coefficient: twice the number of shared trigrams was divided by the sum of unique trigrams in each name.

The requirement of uniqueness of the trigrams voiced for a reason. On the one hand, the calculation of the total number of trigrams the names without taking into account their uniqueness will allow us to confidently distinguish between such names as "John Smith" and "John John Smith". And accounting only unique trigrams will lead to the fact that these names are considered equal.

On the other hand, first, the need for distinguishing between similar names will not arise often. And, secondly, the calculation of relevancy for all trigrammes and not only on the unique, significantly complicate the procedure. Because instead of twice the number of common trigrams will have to count the amount of occurrences of the trigrams from the first name in the second and trigrams from the second name in first. That is, roughly this will increase the time cost of computing relevance is about 2 times. Well and, thirdly, it will not allow you to apply to the evaluation of the relevance of some optimizations, which will be discussed further.

So, the frontal algorithm was written and tested on the first few thousand names. The result has been dismal. As expected, the time matching throughout the array of the names had a distinct quadratic dependence on its size. Trend forecast to the processing of the entire array of 130-140 thousand names was about two years. So much!

But there were initial performance evaluation of the algorithm.

the

and the second Stage. Search of ways of optimization.


First of all, I had to assess which elements of the algorithm need to be optimized, and what is notorious "a bottleneck" hampering the whole system. Optimization areas were identified a few:

    the
  1. Initial fetch from the database should not give the whole array of names, and only those who are already at least a little similar to the test, to reduce the number of other operations. Analysis of execution time of different code fragments showed that the DB queries are "bottle neck". It was necessary to minimize the execution time of each query and the number of issued names of applicants in each sample.
  2. the
  3. comparison Operations and copying the substrings held with strings in multibyte UTF-8 encoding, the performance is several times inferior to the same operations with strings in single-byte encoding.
  4. the
  5. does Not necessarily produce a precise calculation of the relevance coefficient, if the proportion of shared trigrams is small. In this case, there is a possibility to interrupt this process as soon as it will become clear. even if all the other trigrams will be shared, it will not lead to exceeding the desired boundary values.
  6. the
  7. Other small things.

the

Step three. Implementation.


First of all, as the most simple, an attempt has been made to transition from the normalized encoding names UTF-8 to Windows-1251. However, it turned out that MySQL is very jealous of storing in the database strings in different encoding, even different tables. That leads to a non-localizable error during operations with databases. Therefore, the normalized name is stored in the additional join table is still in UTF-8 format, and if necessary re-encoded in Windows-1251.

Storing the normalized name saves time, since the normalization should be performed on each entry only once.

In addition, in the same attached table stores the number of unique trigrams in the normalized name. This value can be used to force completion of the computation of relevance in the case if it becomes apparent that too great a distinction between the names. To calculate each time the number of unique trigrams in the name is no longer necessary.
And, most importantly, treated names indexed trigrammus index. Ie in the additional table lists all the unique trigrams present in the normalized name. In this case, trigrams themselves, not to store them in a "slow" UTF-8, stored as a integer hash — Junior three-byte numbers correspond to the numeric values of corresponding characters in the character set Windows-1251.

But it was made completely ineffective, as was found in the future, the approach for each of the trigrams of the name check was done its own sampling of the names of applicants, which are then combined into a single array. Each name in this array was accompanied by a counter — how many samples it was found. It was assumed that it will give some gain in computing relevance due to the fact that it is not necessary will then put these names on trigrams — it was enough to use this counter to calculate the relevance.

Unfortunately, repeated sampling from the MySQL database (mostly) and the formation of an integrated array of names (to a lesser extent) gave the following overhead, which was much higher than the winning resulting from the simplification of the calculation of relevance.

These optimizations failed to achieve a significant reduction in time costs:

Application trigrammes index for the initial selection reduced the forecast result time from two years to less than three months. To use it as a numeric value instead of a substring in UTF-8 made it possible to further reduce the time of processing by another 10-12%. But still, 2.5 months of continuous work seemed too much time. A thin place remained the initial sample of similar names.

the

Step four. Improvement.


Multiple access to the database for each trigrammes index was not the best idea. Moreover, numerous samples were run long enough, and the consolidated array is very large, and its processing has required considerable time despite the use of the General counter of trigrams for each item.

Had to abandon the idea of this counter, but to the entire initial set of names of applicants for a single query using the SQL WHERE clause IN. This has reduced the forecast several times up to one month.

Next was the modified calculation procedure of relevance. On the basis of the lower boundary relevance, and the number of unique trigrams the names of the calculated maximum number of "misses" is the number of trigrams in the name of the applicant, not included in the check name. Then all trigrams behalf of the applicant was tested on the entry to check the name. Once the number of mistakes exceeded the maximum, the comparison was stopped. It is possible to reduce the prediction time by a few percent. But before the industrial values was still very far away.

the

Step five. Decision.


The task seemed insurmountable. On the one hand was required to significantly reduce the number of names in the initial sample, this reduction is not supposed to be too artificial, not to accidentally cut off names that are close to the lower limit of the valid range of relevance.

Find the approach easy viewing of found pairs of similar names. Almost all of these pairs were common subsequences with a length of 6-7 characters (including enclosing double spaces). By simple arguments (which are not listed here due to their lack of rigor) it can be concluded that to achieve relevance than 0.75, the normalized names must have a common substring of 5 characters or longer.

Thus, by indexing a preliminary search trigrammes go to indexing pentagrams (5-character substrings). To fit the hash of the pentagram in an integer variable, with 32-bit PHP version size 32 bits, each symbol represented as 5-bits, good only 31 characters (without the "E" and "b") and a space.
the

Another small Supplement.


As mentioned above, when calculating the relevance of the unique trigrams, one of the names was compared for coincidence with trigrammic middle name. The calculation was stopped if the maximum number of "misses" the maximum quantity allowed.

If the number of unique trigrams in the first name much less than the number of unique trigrams in the second, the calculated maximum number of misses may be negative. This means that these names are well known not to be similar to a target value relevance. And check them no need.

But this cutoff is not running in the opposite case, if the number of unique trigrams in the first name much larger than the number of those in the second name. Therefore, in addition to the query sample was introduced the boundary values of the number of unique trigrams in the normalized name.

It is somewhat debatable improvement. When setting the boundary values of the relevancy of 0.75 does not cause the acceleration, and, on the contrary, leads to a performance loss of 3% due to the increase in complexity of the sample. But if you set the border relevance of 0.8, we get for 3% win.

However, this technique was abandoned. The loss is not so great, but at the beginning you can do a preliminary search for similar couples with a high degree of relevance. And only after pre-cleaning to set the threshold value equal to 0.75.

the

Result.


As a result of the transition from initial sample total trigrammes to selection pentagrams managed to significantly speed up the script. A similar search with a relevance of 0.8 pairs among 138.5 thousand names was 5 hours instead of one month. Ie, a single search for similar names in entire database for a given name is (if we take the quadratic dependence of processing time) about 0.26 s. Multiple lot, but we must remember that this was a test run not on the server with a powerful processor and high-performance disk system, and half-dead on the home computer of eight years ago.

In principle, it would be possible to carry out the initial search even hexagramme (index substring of 6 characters). This gave acceleration to a few more times. But there is a legitimate concern that can be screened out many pairs of Asian short names (and not only them). Yes and no special meaning, because subsequent manual analysis of the obtained pairs (and they had accumulated more than 11 thousand) is in any case a few months.

But indexing hexagramme quite suitable for tips of the form "did You mean ..." when excessive paranoidness is not required, and instead, find the minimum number of the most appropriate names.

In principle, when dealing with long names, you can use the index semisymbolic substrings. And to put them in 32-bit variables to pre-phonetic coding with the aim of reducing the number of characters to 15 and a space. But such a task at that moment was not.

Unfortunately, shortly after writing the script, the website was blocked on charges of copyright infringement. Although he soon moved to another address, due to the leaning problem was not to search for duplicate names.

the

Snippets algorithm:


Structure of the DB tables
CREATE TABLE IF NOT EXISTS `prs_persons` (
`id` int(10) unsigned NOT NULL,
`name_ru` varchar(255) collate utf8_unicode_ci is a default of NULL, //Translated into Russian language of a name
//Other fields
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci is a;

CREATE TABLE IF NOT EXISTS `prs_normal` (
`id` int(10) unsigned NOT NULL,
`name` varchar(255) collate utf8_unicode_ci is a default of NULL, //Normalized name
`num` int(2) unsigned NOT NULL, //Number of unique trigrams in the normalized name
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci is a;

CREATE TABLE IF NOT EXISTS `prs_5gramms` (
`code` int(10) unsigned NOT NULL, //Numeric code of the pentagram
`id` int(10) unsigned NOT NULL, //name Identifier to which there is
KEY `code` (`code`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci is a;


Constants and auxiliary variables, advanced actions:
 $opt_debug_show_sql = 0;
$opt_debug_mode = 0;
$opt_table_prefix = "prs";

define('MIN_RELEVANT', 0.8);
define('SITE_PREFIX', "..."); //Here is stored the prefix page of the person on the site.

//Maximum allowable ratio of the number of unique trigrams
//the below names could be similar.
$len_ratio = MIN_RELEVANT / (2 - MIN_RELEVANT);

$suitablesin = iconv("Windows-1251", "UTF-8",
"АаБбВвГгДдЕеЕеЖжЗзИиЙйКкЛлМмНнОоппррссттууффххццччшшщщьъыыъьээююяяaabbcceeghkkmmnoopprtuxxyy03468");
$suitablesout =
"ААББВВГГДДЕЕЕЕЖЖЗЗИИЙЙККЛЛММННООППРРССТТУУФФХХЦЦЧЧШШЩЩЬЬЫЫЬЬЭЭЮЮЯЯААВЬССЕЕДНККМТНООРРГТИХХУУОЗЧБВ";

$charcode = array( '' => 0, 'A' = > 1, 'B' = > 2, 'To' => 3, 'G' = > 4, 'D' => 5, 'E' = > 6, 'W' => 7,
'S' => 8, 'And' => 9, 'Th' => 10, 'To' => 11, 'L' = > 12, 'M' = > 13, 'N' = > 14, 'About' => 15,
'P' = > 16, 'R' = > 17, 'S' = > 18, 'T' = > 19, 'Y' = > 20, 'f' = > 21, 'X' = > 22, 'C' => 23,
'H' = > 24, 'W' = > 25, 'Y' = > 26, 'B' = > 27, 'S' = > 28, 't' = > 29, 'u' = > 30, 'I' => 31); //Pyatiletie character codes.

set_time_limit(0); //script execution Time limit.

function message_die($errno, $error, $file, $line)
{
if ($errno)
{
print "<p><b>Error " . $errno . "" . $file . "(" . $line . "):</b> " . $error;
die();
}
};

$fout = fopen("...", "w"); //Open the file for writing similar pairs.

//Here should be the preparation of a list of names for analysis and a basic cycle comparison.

fclose($fout);


the Body of the main loop of the comparison:
//For everyone else unparsed name:

$id = ...; $name_ru = ...; //Read the person ID and name in UTF-8.

$rusn = " "; //the Normalized name starts by two spaces.
$ischar = FALSE;

for ($j = 0; $j < mb_strlen($name_ru, "UTF-8"); $j++)
{
$char = mb_substr($name_ru, $j, 1, "UTF-8");

if (($pos = mb_strpos($suitablesin, $char, 0, "UTF-8")) === FALSE)
{
if ($ischar)
{
//All invalid characters are replaced by double spaces.
$rusn .= "";
$ischar = FALSE;
}
}
else
{
//Other perekodirovat in Windows-1251, taking into account possible transliteration.
$rusn .= $suitablesout{$pos};
$ischar = TRUE;
}
}

if ($ischar) $rusn .= ""; //Add two spaces at the end.

if (strlen($rusn) < 5) continue; // the Names of some of the spaces are not considered.

$norm = iconv("Windows-1251", "UTF-8", $rusn); //Perekodirovat to UTF-8 for writing to the database.

//Put in array of unique pentagram normalized name:

$subgramms = array();

//Normalized name starts with two blanks with code 0.
//Therefore, they do not encode it, and start the hash computation of the pentagram with the first letter.

$code = ($charcode[$rusn{2}] << 5) | $charcode[$rusn{3}];

for ($j = 4; $j < strlen($rusn); $j++)
{
$code = (($code << 5) | $charcode[$rusn{$j}]) &0x1FFFFFF;

$subgramms[$code] = $code; //write the hash of the pentagram in the array.
}

//Compile a list of unique trigrams in the normalized name and count their number.

$trigramms = array();

for ($k = 0; $k < strlen($rusn) - 2; $k++)
$trigramms[$trigramm = substr($rusn, $k, 3)] = $trigramm;

$n = count($trigramms);

//Calculate the boundary values of the number of unique trigrams in similar names:

$nmin = ceil($n * $len_ratio);
$nmax = floor($n / $len_ratio);

//Read from DB the list of the candidates for the similar names.

$similars = fquery("SELECT n.id AS id, n.name AS name, n.num AS num FROM ^@5gramms AS g ^@normal AS n WHERE g.code IN (^N) AND n.id = g.id AND 

n.num >= ^N AND n.num <= ^N", $subgramms, $nmin, $nmax);

//Written to DB normalized name and the number of unique trigrams.

fquery("INSERT INTO ^@normal (id, name, num) VALUES (^N, ^S, ^N)", $id, $norm, $n);

//Written to the database list of pentagrams that normalized name.

foreach ($subgramms as $key=>$code)
fquery("INSERT INTO ^@5gramms (code, id) VALUES (^N, ^N)", $code, $id);

unset($subgramms); //And release it.

//For each name candidate:

for ($i = 0; $i < @mysql_num_rows($similars); $i++)
{
$similar = @mysql_fetch_assoc($similars) OR
message_die(@mysql_errno(), @mysql_error(), __FILE__, __LINE__);

$name = iconv("UTF-8", "Windows-1251", $similar['name']);
$simid = $similar['id'];
$m = $similar['num']; //the Number of unique trigrams.
$nm = 0; //the Number of common trigrams.
$done = TRUE;
$miss = floor($m - MIN_RELEVANT * ($n + $m) / 2); //the Number of allowed mistakes.

for ($k = 0; ($k < strlen($name) - 2) & & ($miss > = 0); $k++)
{
if (strpos($name, $trigramm = substr($name, $k, 3), 0) == $k)
{
//If the trigram was found for the first time (unique):
if (isset($trigramms[$trigramm])) $nm++; //Increase the counter total.
else $miss--; //Otherwise decrement the remaining number of allowed misses.
}
}

if ($miss > = 0)
//If the number of failures is not exceeded, output to a file information about
//a similar pair of name and value relevance.
fwrite($fout, SITE_PREFIX . $id ."\t" . $rusn ."\t" . SITE_PREFIX . $key . "\t" . $name . "\t" . ($nm + $nm) / ($m + $n) . "\r\n");


@mysql_free_result($similars) OR message_die(@mysql_errno(), @mysql_error(), __FILE__, __LINE__);

unset($trigramms); //free the list of trigrams for reuse.

fclose($fout); //Close the file with the list of similar pairs.


Function formatted query to MySQL
[7]
the
// Syntax: fquery($query_template_text, $argument's_1_value, $argument's_2_value, ...)
// Special characters for a query template:
// ^@TableName - indicates that the combination ^@ is to be replaced with table prefix
// ^N - numeric parameter(s) (is not to be quoted), separated by comma if is array
// ^S - the string parameter(s) (is to be quoted), separated by comma if is array
// ^0 - "NULL" or "NOT NULL"

// (c) Andrey Grigoryev aka aka GrAnd Pochemuk
// Thanks to Artjom Kamnev (Kamnium), Mesilov Maxim (Severus) for idea
// http://life.screenshots.ru

// When successed query returns Recordset for SELECT or True for others.
// When error occurs returns False.

function fquery()
{
global $opt_debug_mode;
global $opt_debug_show_sql;
global $opt_table_prefix; // getting prefix from the site's options

if (is_array(func_get_arg(0))) $args = func_get_arg(0);
else $args = func_get_args();

$qtext = $args[0]; // the first argument is always query template text
if (empty($qtext)) return false; // Hmm, nothing to do!

$qtext = str_replace("^@", ($opt_table_prefix == "") ? "" : ($opt_table_prefix . '_'), $qtext); // replacing with table prefixes

$i = 0; $curArg = 1;

$query = "";

while ($i < strlen($qtext)) // strlen is always up-to-date, even if special chars are replaced
{
if ($qtext{$i} == '^')
{
if ($curArg >= count($args)) return false; // too many parameters in the query template!

$i++;

switch ($qtext{$i})
{
case 'N':
{
if (is_null($args[$curArg]))
{
$query .= "NULL";
continue;
}

if (is_array($args[$curArg])) $query .= implode(", ", $args[$curArg]);
else $query .= $args[$curArg];

break;
}
case 'S':
{
if (is_null($args[$curArg]))
{
$query .= "NULL";
continue;
}

if (is_array($args[$curArg])) $query .= "'" . implode("', '", $args[$curArg]) . "'";
else $query .= "'" . $args[$curArg] . "'";

break;
}
case '0':
{
if (is_null($args[$curArg])) return false; // incorrect parameter, nulls are not allowed!

$args[$curArg] = strtoupper($args[$curArg]);

if (($args[$curArg] != "NULL") && ($args[$curArg] != "NOT NULL")) return false; // incorrect parameter, a "NULL" or 

"NOT NULL" only!

$query .= $args[$curArg];

break;
}
default: $query .= $qtext{$i};
}

$i++; $curArg++;
}
else $query .= $qtext{$i++};
}

if ($opt_debug_show_sql == 1) print('<P><CODE>Query string: "' . $query . '"<CODE></P>' . "\r\n");

$ResultData = mysql_query($query);

if (mysql_errno() <> 0)
{
if ($opt_debug_mode == 1)
{
print('<P><CODE>error MySQL: #' . mysql_errno() . ': '. mysql_error() . ' 
');
print('Query string:' . $query . '</CODE></P>');
}
}

return($ResultData);
} // End of fquery


the

test run Result at the first of thousands of records:


the operation Result
the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the
ADDRESS 1 NAME 1 ADDRESS 2 NAME 2 RELEVANCE
/people/784 PETER JASON /people/389 PETER JACKSON 0.8125
/people/1216 CHARLES DANCE /people/664 CHARLES DENNIS 0.8
/people/1662 STUART f WILSON /people/1251 STUART WILSON 0.914285714286
/people/1798 MICHAEL MANN /people/583 MICHAEL LEHMANN 0.846153846154
/people/2062 MICHAEL BYRNE /people/265 MICHAEL biehn 0.8
/people/2557 Geena DAVIS /people/963 GENE DAVIS 0.8
/people/3093 J. J. JOHNSON /people/911 DON JOHNSON 0.818181818182
/people/3262 TOM EVERETT /people/586 TOM EVERETT SCOTT 0.848484848485
/people/3329 ROBERT RITTY /people/3099 ROBERT BEATTY 0.827586206897
/people/3585 TRACY KAY WOLFE /people/2810 TRACI WOLFE 0.857142857143
/people/3598 ALEXANDER KALUGIN /people/2852 ALEXANDER KALYAGIN 0.85
/people/3966 SERGEI BODROV /people/2991 SERGEI BODROV ML 0.888888888889
/people/3994 SERGEI BOBROV /people/3966 SERGEI BODROV 0.8125
/people/4049 RICHARD LEWIS /people/2063 RICHARD J. LEWIS 0.882352941176
/people/4293 JERRY LIVLI /people/2006 JERRY LEE 0.88
/people/4377 JOAN CUSACK /people/3774 JOHN CUSACK 0.827586206897
/people/4396 DEAN MCDERMOTT /people/2614 DYLAN MCDERMOTT 0.833333333333
/people/4608 SHAWN JOHNSTON /people/2036 J. J. JOHNSTON 0.8
/people/4981 CHRISTOPHER MEI /people/3233 CHRISTOPHER MURRAY 0.8
/people/5019 JANE ALEXANDER /people/381 JASON ALEXANDER 0.842105263158
/people/5551 CARLOS ANDRES GOMEZ CARLOS GOMEZ 0.810810810811
/people/5781 ALEX NOYBERGER /people/4288 ALEX NEUBERGER 0.8
/people/5839 JOEY TRAVOLTA /people/935 JOHN TRAVOLTA 0.8125
/people/5917 JOE JOHNSTON /people/2036 J. J. JOHNSTON 0.833333333333
/people/5917 JOE JOHNSTON /people/4608 SHAWN JOHNSTON 0.8
/people/6112 THOMAS RYAN /people/4869 THOMAS J. RYAN 0.823529411765
/people/6416 BRIAN GEORGE /people/3942 GEORGE BRYANT 0.848484848485
/people/6520 JOHN CARNEY /people/5207 JOHN KANI 0.8
/people/6834 JOHN J. ANDERSON /people/5049 JOE ANDERSON 0.838709677419
/people/6836 MICHAEL ESTON /people/5056 MICHAEL WESTON 0.827586206897
/people/6837 DAVID BARON /people/5884 DAVID BARON 0.827586206897
/people/7261 BILLY GRAY /people/1695 BILLY RAY 0.8
/people/7361 ALAN DAVID /people/3087 DAVID ALAN BASH 0.838709677419
/people/7447 DAVID AYER /people/2277 THAYER DAVID 0.814814814815
/people/7497 ALEKSANDR KARAMNOV /people/3857 ALEXANDER KARPOV 0.8
/people/7499 NICHOLAS LUMLEY /people/4424 NICHOLAS 0.827586206897
/people/7534 RICHARD REHL /people/3547 RICHARD NIL 0.857142857143
/people/7547 SVETLANA STARIKOV /people/1985 SVETLANA SVETIKOVA 0.8
/people/7677 JACE ALEXANDER /people/381 JASON ALEXANDER 0.842105263158
/people/7677 JACE ALEXANDER /people/5019 JANE ALEXANDER 0.833333333333
/people/8000 GREGORY SMITH /people/7628 GREGORY P SMITH 0.909090909091
/people/8137 CASPER CHRISTENSEN /people/128 JESPER CHRISTENSEN 0.8
/people/8186 SHANE KOSUGI /people/6235 KANE KOSUGI 0.814814814815
/people/8219 BRANDON JAMES OLSON /people/797 JAMES OLSON 0.810810810811
/people/8442 GUNNAR LAFSSON /people/7033 GUNNAR JONSSON 0.8
/people/8458 JOHN ALEXANDER /people/381 JASON ALEXANDER 0.810810810811
/people/8458 JOHN ALEXANDER /people/5019 JANE ALEXANDER 0.8
/people/8614 DAVID HERMAN /people/4945 DAVID HAYMAN 0.8
/people/8874 NICOLAS ROEG /people/1667 NICHOLAS ROWE 0.827586206897
/people/8987 DAVID ROSS /people/4870 DAVID CROSS 0.814814814815
/people/9132 ROBERT LONG /people/7683 ROBERT LONGO 0.827586206897
/people/9202 ROBERT MENDEL /people/3410 ROBERT MANDEL 0.8125
/people/9229 ASHLEY LAWRENCE /people/2534 ASHLEY LAWRENCE 1
/people/9303 JOHN ELLARD /people/8703 JOHN AYLWARD 0.827586206897
/people/9308 SEAN ROBERTS /people/6552 SEAN O ROBERTS 0.903225806452
/people/9347 STEPHEN SERJIK /people/2911 STEPHEN SURZHIK 0.8
/people/9432 POLLY SHANNON /people/2240 MOLLY SHANNON 0.8
/people/9583 JULIE HARRIS /people/904 JULIUS HARRIS 0.838709677419
/people/9788 ANTHONY STARR /people/8308 ANTHONY STARK 0.8
/people/9835 MICHAEL WATKINS /people/4727 MICHAEL WATKINS 0.864864864865
/people/9893 STEVE MARTINO /people/6457 STEVE MARTIN 0.827586206897


the

References and notes


1. Phonetic encoding.
Phonetic algorithms. Habrahabr
"what is your name", or Russian MetaPhone (Metaphone Russian description)

2. the Levenshtein Distance. Wikipedia
3. calculating the similarity of words based on the allocation of common forms implemented in PHP function similar_text. In the PHP documentation States that this function is the algorithm Oliver [1993]. However, originally this algorithm is called "the Ratcliff/Obershelp Pattern Recognition Algorithm" was posted by John W. Ratcliff on the website programmers Dr. Dobb's (Pattern Matching: the Gestalt Approach) in 1988. And Ian Oliver just used it in his book "Programming Classics: Implementing the World's Best Algorithms".

4. the Source code of the algorithm Jaro-Winkler can be viewed for example here: Jaro-Winkler Distance

5. Trigeminy index or "Finding mistakes". Habrahabr

6. Jeannette "angel" Devereaux is a character from the cult media franchise "Wing Commander", which includes computer cosmochemistry, strategy, other forms of games, books and film. Here are only due to the fact that I do not understand the French pronunciation of names, have long believed that it sounds like "Deveraux". Illustration of erroneous translation is not "by ear" and "eye".

7. Basis functions formatted SQL queries honestly borrowed from this time Artem and Maxim Mazilova from the article "SQL injection: the struggle for pleasure". Unfortunately, the site these guys recently downloading a copy of the article can still be found: Sql injection: the struggle in pleasure (unfortunately, without sources). Removed some controversial features. Instead, they added the ability to pass as argument array.
Article based on information from habrahabr.ru

Популярные сообщения из этого блога

Approval of WSUS updates: import, export, copy

Kaspersky Security Center — the fight for automation

The Hilbert curve vs. Z-order