Had 2 list of cities in Hebrew. Used utl_Match.jaro_winker_similarity to find the similarity.
I have 2 tables: Bagrut with column LOC and Israel_all_loc with columns SETTLEMENT_CODE, SETTLEMENT_NAME. There are small differences in some of the names such as:
תל אביב-יפו vs. תל אביב - יפו
or
כנרת )קבוצה( vs. כנרת (קבוצה)
or
נהרייה vs. נהריה
The list are few hundreds each, so did a cartesian join with a similarity ranking:
The 95 similarity was selected after some manual examinations.
Next checked each LOC with more than 1 line and selected the max(similarity).
Still had to delete 4 lines manually.
No comments:
Post a Comment