Sunday, February 22, 2026

Comparing Hebrew (or other non-English) strings in Oracle

Had 2 list of cities in Hebrew. Used utl_Match.jaro_winker_similarity to find the similarity.


I have 2 tables: Bagrut with column LOC and Israel_all_loc with columns SETTLEMENT_CODE, SETTLEMENT_NAME. There are small differences in some of the names such as:

תל אביב-יפו vs. תל אביב - יפו

or

כנרת )קבוצה( vs. כנרת (קבוצה)

or

נהרייה vs. נהריה 

The list are few hundreds each, so did a cartesian join with a similarity ranking:

select * From
(
select SETTLEMENT_CODE, SETTLEMENT_NAME, LOC,
utl_MATCH.jaro_winkler_similarity(SETTLEMENT_NAME, LOC) similarity
from
(select distinct SETTLEMENT_CODE, SETTLEMENT_NAME from ISRAEL_ALL_LOC) a,
(select distinct LOC from BAGRUT) B
)
where similarity>95

The 95 similarity was selected after some manual examinations.

Next checked each LOC with more than 1 line and selected the max(similarity). 

Still had to delete 4 lines manually.


No comments:

Post a Comment