More on UTF-8 and transliterations

Tuesday, 3 November 2009

More on UTF-8 and transliterations

A further complication to be aware of is that there are multiple ways of
representing certain characters, some duplicates and (as I've now
found) some wrong ways to do things. This is going to impact anything we
store, plus we may need to check things we import too.

For instance, the official way to represent alpha + oxia (acute) accent
is as 03B1 (alpha) + 0301 (accent). However, it may show up as 03B1 +
0384, 0384 being the tonos accent used in modern Greek which looks
similar but apparently isn't the same. Or as 03ac which is alpha + tonos
combined. (A so-called "presentation form" which I think is meant to
include made-up versions of the most usual forms.)

Similarly in Hebrew, sin plus dagesh is either 05e9 + 05bc (sin dot) +
05c2 (dagesh) or fb2d.

A further related issue is that of transliterations. In dictionary
articles etc, we can supply Greek and Hebrew, but many of our readers
won't know them, particularly Hebrew, so a transliteration would be
helpful. Either we always give a transliteration or provide a couple of
switches to say "Greek in Greek/Greek in transliteration" and ditto for
Hebrew. If we're storing two variant forms anyway, then we could add a
third which is a "clean non-accented" form there just for searching
purposes. Would this get round the problems Troy suggested?

Colin

Tyndale STEP - Programming

Tuesday, 3 November 2009

More on UTF-8 and transliterations

No comments:

Post a Comment

Documents:

Blog Archive

About Me