Monday, 2 November 2009

Re: [Tyndale STEP - Programming] Re: [Tyndale STEP - Programming] Unicode in data...

We have filters in our engine which allow source texts to be stored in a number of different encodings, but this is mostly for historical reasons. Encoding text as UTF-8 has mostly won out. This is what our C++ engine uses internally and what we prefer our source documents to be encoded as.

There are a number of issues to consider when decciding on a persistence layer. The predominate practical use cases in view are _presentation_, and _searching_. For example, presentation might lead one to store a fully annotated, accented, variable-case, NFK normalized, UTF-8 encoded document, which indeed is our preference here at CrossWire. But while this makes a rich display straight-forward, this also requires us to do the most filtering to implement a robust search feature. Imagine when a user desires to search the body (not footnotes or annotations) of Greek documents ranging from inscriptions to fully accented contemporary GNT editions, and types hAMARTIA in a search box, typed as b-greek encoded as in my example, unaccented lowercase unicode Greek, or properly accented but not NFK normalized, or what have you. It is not as simple as: SELECT key FROM verses WHERE body LIKE '%hAMARTIA%';

Hope this report of our pain is helpful.

-Troy.


Tyndale STEP Project <TyndaleSTEP@gmail.com> wrote:

>Hmmm. Good point. We need to check. I believe they just store them in
>the modules themselves. But it will impact the creation of the database
>schema, as presumably, if we are referencing scripture, we will want to
>be able to quote the text and therefore will also need unicode.
>
>Java tends to abstract away the need to worry about encoding details,
>but worth checking on the crosswire community...
>
>Chris
>
>
>
>2009/11/2 Tyndale STEP Project <TyndaleSTEP@gmail.com>
>Does anyone know how hard it will be to include Unicode for Greek &
>Hebrew?
>Do we know what JSword does for Greek & Hebrew (I assume they use
>Unicode) - how do they store it?
>
>I think I remember Troy telling me that Unicode is easy to store as a
>two-byte character with one bit set as a flag,
>but I can't remember the details.
>
>David IB
>
>--
>Posted By Tyndale STEP Project to Tyndale STEP - Programming on
>11/02/2009 05:37:00 AM
>
>
>
>--
>Posted By Tyndale STEP Project to Tyndale STEP - Programming on
>11/02/2009 05:47:00 AM

No comments:

Post a Comment