Tyndale STEP - Programming: Re: [Tyndale STEP - Programming] Unicode texts in database

Tuesday, 3 November 2009

Re: [Tyndale STEP - Programming] Unicode texts in database

To a certain extent, if we're going to search biblical texts, these will be provided by JSword API and modules, and so I'd be very surprised if it wasn't designed to cope with unicode, etc. But worth asking. I'm happy to post a message on the forum, but I feel a little out of my depth on the whole encoding things...
Let me know if you want me to post...

Chris

2009/11/3 Tyndale STEP Project <TyndaleSTEP@gmail.com>

This problem of searching pointed Hebrew or accented Greek sounds like a big issue. How does JSword cope with it?
The strategy which springs to mind is having two text fields - one with the display text and one with the search text.
The display text would be fully Unicode with accents or pointing,
and the search field would be transliterated into ASCII with several elements separated an unusual character eg "="
by which the word could be expanded into an array.

The elements of the array would be
* word position in the verse/section
* word with pointing etc
* word without pointing etc
* Strongs number
* morphology (where available)

So Gen.1.1 would start out:

1=B;rExWijt=brxWjt=07225= 2=BArAx=brx=01254=8804 [using the Tyndale Unicode Hebrew Keyboard for the ascii]

If they want all the instances of bara (Strongs 01254) in Qal Perfect (code 8804) they could search for "01254=8804"
If they wanted every instance where the word bereshit occured they could search for the pointed form or the unpointed form.
When the array is expanded, the program gets the position of the word, so the display can highlight that word.

Anyway, this is just the ramblings of someone who hasn't thought about it much.
Anyone know how Sword does this searching?

David IB

At 21:15 02/11/2009, Tyndale STEP Project wrote:

We have filters in our engine which allow source texts to be stored in a number of different encodings, but this is mostly for historical reasons. Encoding text as UTF-8 has mostly won out. This is what our C++ engine uses internally and what we prefer our source documents to be encoded as.

There are a number of issues to consider when decciding on a persistence layer. The predominate practical use cases in view are _presentation_, and _searching_. For example, presentation might lead one to store a fully annotated, accented, variable-case, NFK normalized, UTF-8 encoded document, which indeed is our preference here at CrossWire. But while this makes a rich display straight-forward, this also requires us to do the most filtering to implement a robust search feature. Imagine when a user desires to search the body (not footnotes or annotations) of Greek documents ranging from inscriptions to fully accented contemporary GNT editions, and types hAMARTIA in a search box, typed as b-greek encoded as in my example, unaccented lowercase unicode Greek, or properly accented but not NFK normalized, or what have you. It is not as simple as: SELECT key FROM verses WHERE body LIKE '%hAMARTIA%';

Hope this report of our pain is helpful.

-Troy.

--
Posted By Tyndale STEP Project to Tyndale STEP - Programming on 11/03/2009 11:09:00 AM

Tyndale STEP - Programming

Tuesday, 3 November 2009

Re: [Tyndale STEP - Programming] Unicode texts in database

No comments:

Post a Comment

Documents:

Blog Archive

About Me