Member-only story
Using sentence transformer for Vietnamese Province resolution
I recently stumped into an interesting problem to resolve user input to correct Vietnamese address. Basically user can type in an address and the system needs to identify the province, district, and ward, mapping them to unique codes. These codes are then used for downstream processing. This task can be viewed as an entity resolution problem, where user input is normalized to a single value.
This can be a challenging problem because user can type in name using aliases, without proper accents or typos. Below are some examples for Vietnamese provinces.
The existing solution uses fuzzy-search with some hardcoded phrases. It works good enough but there are clearly room for improvement. Inspired by this blog, I want to approach this as a different way.
Address resolution with as a retrieval problem
This can be seen as a dense retrieval problem. The idea is we encode the list of entities (province, city names, district etc) into embeddings. When user provides the input, it is also encoded into an embedding. Address resolution then becomes a matter of searching for the query embedding (representing the user input) within the vector space.