Two edge cases in PostgreSQL full-text search
We widely use PostgreSQL full-text search in our projects. It is fast, reliable, and doesn't add any additional technical complexity. But sometimes it may not work as you expect it too.
Usually we add a ts_vector
column right in our data tables. Let's see how it has it is done.
title | tsv |
---|---|
Handyman | ‘handyman’:1 |
Heating & Cooling | ‘cool’:2 ‘heat’:1 |
Painting | ‘paint’:1 |
Roofing | ‘roof’:1 |
Column tsv
is generated with to_tsvector
function from a title
column with a trigger function.
CREATE FUNCTION skills_tsv_update_trigger() RETURNS trigger AS $$
begin
new.tsv :=
to_tsvector('pg_catalog.english', COALESCE(new.title, ''))
return new;
end
$$ LANGUAGE plpgsql;
Then we can query this table using to_tsquery
fuction.
SELECT * FROM skills
WHERE
tsv @@ to_tsquery('pg_catalog.english', 'roof')
-- 1 record found
A record with Roofing
has been found, and this makes sense. But what should be found for handy
query? You can imagine that it will be a Handyman
skill and you are wrong. Before explaining why this happens, let's try another query.
SELECT * FROM skills
WHERE
tsv @@ to_tsquery('pg_catalog.english', 'roofi:*')
-- no records found
No records were found, even if we have a record with Roofing
title in our table.
The to_tsvector
function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries is consulted, where the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token. For example, roofing
became roof
. Some words are recognized as stop words, like &
in our example. It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configuration english for the English language.
to_tsquery
function parses user input and converts it to tsquery
data type. It uses the same dictionaries as to_tsvector
function. Let's try to understand what happens in two queries that were listed above.
The result of to_tsquery('pg_catalog.english', 'handy')
will be handi
because of pluralization rules for words ending on y
. The query listed below doesn't return any records because 'handyman':1
vector does not overlap with handi
query.
SELECT * FROM skills
WHERE
tsv @@ to_tsquery('pg_catalog.english', 'handy')
-- no records found
How we can cover this situation? As you remember you can change a configuration that is used to parse user input. Let's try a simple
configuration.
select to_tsquery('pg_catalog.simple', 'handy')
-- 'handy'
Looks like this is what we need, let's add this to our query.
SELECT * FROM skills
WHERE
tsv @@ to_tsquery('pg_catalog.english', 'handy')
OR
tsv @@ to_tsquery('pg_catalog.simple', 'handy')
-- 1 record found
A problem with Roofing
and roof
can be solved in the same way. Let's change the function that is used to populate tsv
column.
CREATE FUNCTION skills_tsv_update_trigger() RETURNS trigger AS $$
begin
new.tsv :=
to_tsvector('pg_catalog.english', COALESCE(new.title, '')) ||
to_tsvector('pg_catalog.simple', COALESCE(new.title, ''))
return new;
end
$$ LANGUAGE plpgsql;
title | tsv |
---|---|
Handyman | ‘handyman’:1,2 |
Heating & Cooling | ‘cool’:2 ‘cooling’:4 ‘heat’:1 ‘heating’:3 |
Painting | ‘paint’:1 ‘painting’:2 |
Roofing | ‘roof’:1 ‘roofing’:2 |
Now vector has more data and query below works as expected.
SELECT * FROM skills
WHERE
tsv @@ to_tsquery('pg_catalog.english', 'roofi:*')
-- 1 record found