Henrik Norbeck, Stockholm 2017
Here I present an algorithm for fast approximate text searches in a database using an index table of bigrams. A search using this algorithm only takes 50% longer time than an exact search, and returns the most relevant results. All the information needed to implement this algorithm is presented here, including code. I also discuss how this algorithm could be adapted for finding melody fragments in a music database.
In my search page here on my ABC tune collection I have recently implemented approximate searches (often called "fuzzy search" and sometimes "loose match" or "near match"). Here I will describe the algorithm used, to begin with in terms that can be understood by "anybody", i.e. those who have no special knowledge of programming or databases, and further down also in programming and database terms.
Approximate text matching means that you can find relevant texts, even if you don't know exactly how a word or name is spelled, e.g. when searching for "humors" (American spelling), you will also find "humours" (European spelling).
There are several algorithms for approximate text matching, and many of them have their merits and practical uses. They are used for instance for search engines such as Google, but also for instance for matching of DNA sequences or proteins. You can read more about approximate text matching on Wikipedia.
In my program AbcMus, I implemented an approximate search back in 2002, using the Bitap algorithm (also called Baeza-Yates-Gonnet algorithm). This search algorithm, however, has to go through all the text (e.g. all the tune titles) to look for "near matches". This means that when the material to search is large (e.g. thousands of tunes) it will take a long time.
In 2007, when I worked for The Swedish National Collections of Music making the web presentation of the scanned pages from the music collection of Folkmusikkommissionen and music books of the Stockholm Music Museum, I wanted to implement an approximate search of the metadata, due to the large amount of names, where spellings might differ slightly. Here I had a database of about 30000 pages, each containing lots of text, so a "brute force" search using the Bitap algorithm would be far too slow and consume too much memory for each search request. I needed an algorithm that made a smaller selection from my huge text database.
I started looking for other approximate search algorithms and found this article by Simon White at Catalysoft. I found that an algorithm based on this idea would do the job, and implemented it. The idea works as follows.
A bigram (also called digram or 2-gram) is a sequence of two letters. (More generally n-grams, sequences of n letters, have many uses in lexicography.) From your text database you extract all the words and the bigrams for each word. E.g. the word "sealed" would produce the bigrams "se", "ea", "al", "le" and "ed". The words and corresponding bigrams are stored in database tables. A short example could look like this.
Word | Bigrams |
---|---|
sealed | se, ea, al, le, ed |
healthy | he, ea, al, lt, th, hy |
heard | he, ea, ar, rd |
herded | he, er, rd, de, ed |
help | he, el, lp |
sold | so, ol, ld |
If you search this for e.g. the word "healed", you first split your search word into bigrams, in this case "he", "ea", "al", "le", "ed". Then you calculate how many of these bigrams match the bigrams for each word in the database, as a percentage of the number of bigrams in the search string.
Word | Bigrams | Score for "healed" |
---|---|---|
sealed | se, ea, al, le, ed | 80% |
healthy | he, ea, al, lt, th, hy | 60% |
heard | he, ea, ar, rd | 40% |
herded | he, er, rd, de, ed | 40% |
help | he, el, lp | 20% |
sold | so, ol, ld | 0% |
A wise choice would be to only present matches where the score is larger than some limit, e.g. 50%.
This idea is then expanded so that you search for every word in the user's input and add up the scores, e.g. a search for "healed herd" would match "sealed herded" (90%) and "sealed heard" (65%).
This algorithm is a version of what is called a Sørensen-Dice index (or Sørensen-Dice coefficient or F1 score or Dice similarity coefficient (DSC)), which was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, in 1948 and 1945 respectively, as a statistic used for comparing the similarity of two samples.
Now we get down to the technical details of how to implement this in a database. Here I present how I have implemented this in my title search. If you want to implement something similar in your database, you can just slightly modify my code below. I use MySQL, but the example should work with other SQL dialects with only small modifications.
Since I first implemented an approximate search using this algorithm in 2007, I have improved it significantly, and it is this improved version I present here.
For all the records in the table you want to search (e.g. title for tune titles), take the text and split it into words, which are stored in a table titlewordlist, with references to the table title. Here a tune can have more than one title.
CREATE TABLE title (
titleid INT NOT NULL AUTO_INCREMENT,
tuneid INT,
title VARCHAR(70)
);
CREATE TABLE titlewordlist (
word VARCHAR(30),
titleid INT
);
Before storing the words, I also make them lowercase and "remove all accents", e.g. ò ó ô ö ø are all converted to "o". Many users do not know how to type "letters with accents" on their keyboard, and often wouldn't know where to place them even if they could type them. I also remove any apostrophes (because many people aren't sure where to use them), so e.g. "don't" will become "dont". I also skip all words shorter than 4 characters.
For all words in titlewordlist, extract the bigrams to a table titlebigram, with references to titlewordlist.
CREATE TABLE titlebigram (
bigram CHAR(2),
word VARCHAR(30)
);
Of course you have to create appropriate indexes on these tables, in order to make your search fast.
The tables titlewordlist and titlebigram of course also have to be updated whenever you delete, insert or update in the table title. This could be implemented in a database trigger (though I must confess I hate database triggers on principle - they often create confusion when searching for problems in the program).
An example for the tune title "Humours of Ballyloughlin, The".
word | bigrams | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
humours | hu | um | mo | ou | ur | rs | ||||||
ballyloughlin | ba | al | ll | ly | yl | lo | ou | ug | gh | hl | li | in |
Now you have your tables ready for searching.
When processing a search request from the user, first do the same processing of the search string as when creating the search tables. I.e. split into words, make lowercase, remove accents, remove apostrophes, skip words shorter than 4 characters. This is (as far as I can see) most efficient to perform in your code, not in the database.
Then you create and fill a temporary table for the bigrams of the search string words, and execute a complicated SELECT statement (see below).
The following shows the SQL code for the search "Humors of Ballylochlin" (we assume the user is an American, who has heard the tune name, but does not know exactly how it is spelled).
CREATE TEMPORARY TABLE srchbigram(w int, len int, bigram char(2));
-- w = search-word number
-- len = length of search-word
INSERT INTO srchbigram(w, len, bigram) VALUES(1, 5, 'hu');
INSERT INTO srchbigram(w, len, bigram) VALUES(1, 5, 'um');
INSERT INTO srchbigram(w, len, bigram) VALUES(1, 5, 'mo');
INSERT INTO srchbigram(w, len, bigram) VALUES(1, 5, 'or');
INSERT INTO srchbigram(w, len, bigram) VALUES(1, 5, 'rs');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'ba');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'al');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'll');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'ly');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'yl');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'lo');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'oc');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'ch');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'hl');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'li');
INSERT INTO srchbigram(w, len, bigram) VALUES(2, 11, 'in');
SELECT title.title, ROUND((asrch.score * 100) / asrch.totlen) AS percent, tune.tuneid
FROM
(SELECT tw.titleid, SUM(tscore.c) AS score, (SELECT 18 AS totlen
FROM
(SELECT srchbigram.w, srchbigram.len, titlebigram.word, COUNT(titlebigram.word) AS c
FROM srchbigram
INNER JOIN titlebigram ON titlebigram.bigram = srchbigram.bigram
GROUP BY srchbigram.w, srchbigram.len, titlebigram.word
HAVING c > srchbigram.len / 2) AS tscore
INNER JOIN titlewordlist AS tw ON tw.word = tscore.word
GROUP BY tw.titleid
HAVING score > totlen / 2) AS asrch
INNER JOIN title ON title.titleid = asrch.titleid
INNER JOIN tune ON tune.tuneid = title.tuneid
ORDER BY percent DESC LIMIT 50;
DROP TEMPORARY TABLE srchbigram;
The innermost SELECT finds all words that match according to their bigram score. There is a cutoff, i.e. the match for each word must be larger than 50% (c > srchbigram.len / 2). If you want to cut off results at some other percentage, change to e.g. "c > srchbigram.len * 60 / 100" for 60%.
The middle SELECT finds all titles matching the words found in the innermost SELECT. Here there is also a cutoff at 50% (score > totlen / 2), which you could also change to e.g. 60%.
The outermost SELECT simply gets the relevant data and sorts it by descending match score. I have set a limit of the 50 best matching titles, which of course can be changed. In a real application, you probably also want more fields from tune, e.g. tune.abc, tune.key, tune.incipit.
"16 AS totlen" is included from the code when building the SQL, and is the sum of the number of bigrams for the words.
The above SQL SELECT is reasonably fast. Compared to an exact search, e.g.
SELECT title.title, tune.tuneid FROM title
INNER JOIN tune ON tune.tuneid = title.tuneid
WHERE title.title LIKE '%humours%ballyloughlin%'
it only takes about 4 times longer time. The whole sql (including the create and drop temporary table and inserts in it) took
0.0138 seconds, compared to 0.0033 seconds for the exact search (which has to get all the rows in title). If ENGINE=MEMORY
is used for the temporary table, it saves some time, and the approximate search will only take about 3 times longer than an exact
search. Further on I will introduce various optimisations to make the SQL faster, but this first version makes easier to
understand how it works.
Below you can see what the search result could look like when presented to the user.
Match | Title |
---|---|
81% | Humours of Ballyloughlin, The |
The indexing tables titlewordlist and titlebigram become bigger than the original table title, but not too terribly big. Below you can see a snapshot of table sizes for my database.
Table | Number of Rows | Data Length | Index Length |
---|---|---|---|
title | 3817 | 256 kB | 272 kB |
titlewordlist | 7827 | 480 kB | |
titlebigram | 19002 | 1552 kB |
The index on the table title is on the field title.title. The tables titlewordlist and titlebigram have a total size of 2492 kB, i.e. roughly 8 times larger than the "standard" index on title.title, and also roughly 8 times larger than the table title. For the lyrics search the sizes are as below.
Table | Number of Rows | Data Length |
---|---|---|
lyrics | 189 | 448 kB |
lyricswordlist | 22170 | 1552 kB |
lyricsbigram | 22202 | 1552 kB |
The tables lyricswordlist and lyricsbigram have a total size of roughly 7 times larger than the table lyrics. This is of course because lyrics have many repetitions of some words.
So the price you have to pay in extra disk space for implementing this type of approximate search is about 7-8 times the size of your texts, depending on the type of text. This still seems reasonable, considering that you get an efficient and simple search - and disk space today doesn't cost all that much.
The SELECT statement above can, as I said, be optimised. As it is so far, it takes about 300% of the time for an exact search.
The first optimisation is that we can improve the table structure. The connection between the tables titlebigram and titlewordlist is the field word, which is a varchar(30). It's more efficient to replace this with an integer. The search actually doesn't need to know about the word (it only needs the bigrams and a unique reference to the word), and it is only used when building the index tables. So we stuff it away into a separate table.
CREATE TABLE title ( -- base table for titles
titleid INT NOT NULL AUTO_INCREMENT,
tuneid INT,
title VARCHAR(70),
PRIMARY KEY(titleid)
);
CREATE TABLE titlewordlist ( -- for storing the words - only used when building index tables
twlid INT NOT NULL AUTO_INCREMENT,
word VARCHAR(30),
PRIMARY KEY(twlid)
);
CREATE TABLE titleword ( -- connects bigrams to titles
twlid INT,
titleid INT,
PRIMARY KEY(twlid, titleid)
);
CREATE TABLE titlebigram ( -- bigrams for the words
bigram CHAR(2),
twlid INT,
PRIMARY KEY(bigram, twlid)
);
You can find code for filling these tables with their content in appendix 1.
This will give fewer rows in the tables, but on the other hand we have an extra table, so the total size of the indexing tables is slightly bigger (though not very much). Still about 8 times bigger than the title table, and about 4 times bigger than the total size for the title table and its index.
Table | Number of Rows | Data Length | Index Length |
---|---|---|---|
title | 3817 | 256 kB | 272 kB |
titleword | 7827 | 400 kB | |
titlewordlist | 3199 | 144 kB | |
titlebigram | 15873 | 1552 kB |
In the SQL above the input is made to the temporary table srchbigram through a number of INSERT statements. Of course it's more efficient to use one single INSERT statement, e.g.
INSERT INTO srchbigram(w, len, bigram) VALUES
(1,5,'hu'),(1,5,'um'),(1,5,'mo'),(1,5,'or'),(1,5,'rs'),
(2,11,'ba'),(2,11,'al'),(2,11,'ll'),(2,11,'ly'),(2,11,'yl'),(2,11,'lo'),
(2,11,'oc'),(2,11,'ch'),(2,11,'hl'),(2,11,'li'),(2,11,'in');
After changing the table structure and the INSERT, the approximate search only takes 230% of the time for an exact search (0.0076 versus 0.0033 seconds). However, when taking a closer look at the statements in the search, it turns out that those for the temporary table take a long time. CREATE TEMPORARY TABLE takes 0.0025 seconds, and DROP TEMPORARY TABLE takes 0.0003 seconds, i.e. 37% of the total time. There's a better way to get the search data into the SELECT - construct it as a union of rows of constants. See this in the improved SQL SELECT below.
SELECT title.title, ROUND((asrch.score * 100) / asrch.totlen) AS percent, tune.tuneid
FROM
(SELECT tw.titleid, SUM(tscore.c) AS score, 16 AS totlen
FROM
(SELECT srchbigram.w, srchbigram.len, titlebigram.twlid, COUNT(titlebigram.twlid) AS c
FROM (SELECT 1 AS w, 5 AS len, 'hu' AS bigram
UNION SELECT 1 AS w, 5 AS len, 'um' AS bigram
UNION SELECT 1 AS w, 5 AS len, 'mo' AS bigram
UNION SELECT 1 AS w, 5 AS len, 'or' AS bigram
UNION SELECT 1 AS w, 5 AS len, 'rs' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'ba' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'al' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'll' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'ly' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'yl' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'lo' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'oc' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'ch' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'hl' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'li' AS bigram
UNION SELECT 2 AS w, 11 AS len, 'in' AS bigram) AS srchbigram
INNER JOIN titlebigram ON titlebigram.bigram = srchbigram.bigram
GROUP BY srchbigram.w, srchbigram.len, titlebigram.twlid
HAVING c > srchbigram.len / 2
) AS tscore
INNER JOIN titleword AS tw ON tw.twlid = tscore.twlid
GROUP BY tw.titleid
HAVING score > totlen / 2
) AS asrch
INNER JOIN title ON title.id = asrch.titleid
INNER JOIN tune ON tune.tuneid = title.tuneid
ORDER BY percent DESC LIMIT 50;
I have also removed the ORDER BY clauses on the inner selects - they were only there for if you wanted to try out the inner selects separately. This, however, does not affect the speed significantly.
If you want to try this in e.g. phpMyAdmin you will probably need a statement SET CHARACTER SET 'latin1'; before the select, to eliminate character set problems. At least I needed that.
You can find code for building the SELECT statement in appendix 2.
Now my approximate search is fast! It only takes 0.0049 seconds (versus 0.0033 seconds for the exact search). So it only takes roughly 50% more time. This is so good that I definitely feel it's worth the extra disk space taken by the index tables. It also compares favourably to getting the whole text corpus into memory and then searching it using the Bitap algorithm (something which is also not practical for large text corpuses).
I currently cannot think of any other way to improve the algorithm, but suggestions are welcome.
The advantage of the above algorithm is that it is fairly simple, yet produces search results that are "good enough". As I said above the algorithm is also very fast.
However, one potential problem for long texts is that it could match words that are far apart in the text, e.g. the first and last words of a 500 word text, which is probably not what the user wants to find. One solution that comes to mind is segmenting long texts into overlapping chunks, but this might exclude some matches, and will also make the index tables bigger and the search algorithm more cumbersome. A better approach is perhaps to store the position of each word in the text and weight the scores for the words according to their proximity.
You could potentially expand this idea to perform approximate music searches. For ABC notation, I would remove any graces, spaces, etc and also reduce the notes to one octave (i.e. G, G g g' all map to g).
You might also want to search intervals instead of notes, which would make the search independent of transpositions. In this case you would convert e.g. "dgbgdgbg" to something like "3254325". However, one problem with intervals is that just one differing note will yield two differing intervals ("dgggdgbg" will become "3004325"). In this example, the bigram match score is just 50% for the intervals, versus 86% for the notes. If you compensate for this lower match score by lowering the threshold for "striking a match", you will probably get many "false positives". However, you could probably first quickly get a rather big result set containing many false positives using a lower threshold, and then process this result set using e.g. the Bitap algorithm to eliminate false matches.
One problem is how you define a "word" for music searches. A bar might be too short (it can contain only one note or even just a rest) and the whole tune is too long (will give too many false positives). Even for Irish music, which has relatively many notes per bar, a typical bar of a jig would only have six notes, i.e. five intervals, i.e. short "words". So maybe you could use "words" of two bars and skip those that are too short. Maybe it's also better in this case to use trigrams instead of bigrams. Some experimentation is needed here to find a good music search algorithm using indexed n-grams.
My brother, who is a microbiologist, tells me that for protein and DNA matching, the most common algorithm is BLAST. After a quick look it seems like it uses trigrams instead of bigrams, but still a similar approach to approximate searching, though I don't know how fast it is compared to the algorithm I present here. An interesting feature is that it uses weighting of amino acids, so that some similarities are weighted higher than others. Something similar could perhaps be applied to music searches, where certain intervals (e.g. third) could be weighted higher.
Some version of the search algorithm presented here could also be used for finding similar texts or tunes in a database, starting from one text or tune.
Here I have described how I have implemented this idea for approximate string matches. It can perhaps be improved, and possibly has some problems I haven't thought of, so you're welcome to e-mail me your comments.
You're free to take the code and use it in your own application or web page, but if you do, please give credit to me with a link to this page.
Here is VB code for building the index tables. For the sake of brevity I have skipped DIM's for variables.
First the tables titlewordlist and titleword
function NoAccents(txt) ' Replaces e.g. "ä" by "a", etc. Also removes apostrophes. ' Full function not presented here for the sake of brevity. end function sub DoInsertTitleWord(titleid, word) sql = "SELECT twlid FROM titlewordlist WHERE word='" & word & "';" set twl = db.execute(sql) if twl.eof then sql = "INSERT INTO titlewordlist (word) VALUES('" & word & "');" db.execute(sql) set twl = db.execute("SELECT MAX(twlid) AS twlid FROM titlewordlist;") twlid = twl("twlid") set twl = nothing else twlid = twl("twlid") set twl = nothing end if sql = "INSERT IGNORE INTO titleword2 (twlid, titleid) VALUES (" & twlid & ", " & titleid & ");" db.execute(sql) end sub maxwordlen = 0 minwordlen = 4 '----- minimum word length db.execute("TRUNCATE TABLE titlewordlist;") db.execute("TRUNCATE TABLE titleword;") set tune = db.execute("SELECT id, title FROM title ORDER BY id;") do until tune.eof title = NoAccents(lcase(tune("title"))) titleid = tune("id") tlen = len(title) i0 = 0 for i = 1 to tlen c = mid(title, i, 1) if (c >= "a" and c <= "z") then if i0 = 0 then i0 = i ' start new word else if i0 <> 0 then ' end of word wordlen = i - i0 if wordlen > maxwordlen then maxwordlen = wordlen if wordlen >= minwordlen then word = mid(title, i0, wordlen) DoInsertTitleWord titleid, word end if i0 = 0 end if end if next if i0 <> 0 then ' end of word wordlen = i - i0 if wordlen > maxwordlen then maxwordlen = wordlen if wordlen >= minwordlen then word = mid(title, i0, wordlen) DoInsertTitleWord titleid, word end if end if tune.movenext loop
Then the table titlebigram.
db.execute("TRUNCATE TABLE titlebigram;") set words = db.execute("SELECT twlid, word FROM titlewordlist;") do until words.eof twlid = words("twlid") word = words("word") wlen = len(word) for i = 1 to wlen - 1 sql = "INSERT IGNORE INTO titlebigram (bigram, twlid) " sql = sql & " VALUES('" & mid(word, i, 2) & "', " & twlid & ");" db.execute(sql) next words.movenext loop
Here is VB code for building the SELECT for the search. For the sake of brevity I have skipped dim's for variables.
I have also skipped the code for processing the search string into words, which is similar to the code in appendix 1.
sql = "SELECT title.title, ROUND((asrch.score * 100) / asrch.totlen) AS percent, tune.tuneid " sql = sql & " FROM " sql = sql & " (SELECT tw.titleid, SUM(tscore.c) AS score, " & totalwordlen - wordcount & " AS totlen " sql = sql & " FROM " sql = sql & " (SELECT srchbigram.w, srchbigram.len, titlebigram.twlid, COUNT(titlebigram.twlid) as c " sql = sql & " FROM ( " isfirst = 1 for w = 1 to wordcount ' For every word in search string wlen = len(word(w)) for i = 1 to wlen - 1 ' for each bigram bigram = mid(word(w), i, 2) if isfirst then isfirst = 0 else sql = sql & "UNION " sql = sql & " SELECT " & w & " AS w, " & wlen - 1 & " AS len, '" & bigram & "' AS bigram " next next sql = sql & " ) AS srchbigram " sql = sql & " INNER JOIN titlebigram ON titlebigram.bigram = srchbigram.bigram " sql = sql & " GROUP BY srchbigram.w, srchbigram.len, titlebigram.twlid " sql = sql & " HAVING c > srchbigram.len / 2 " sql = sql & " ) AS tscore " sql = sql & " INNER JOIN titleword AS tw ON tw.twlid = tscore.twlid " sql = sql & " GROUP BY tw.titleid " sql = sql & " HAVING score > totlen / 2 " sql = sql & " ) AS asrch " sql = sql & " INNER JOIN title ON title.id = asrch.titleid " sql = sql & " INNER JOIN tune ON tune.tuneid = title.tuneid " sql = sql & " ORDER BY percent DESC LIMIT 50; "
This version of this page published 22 April 2017 by Henrik Norbeck