our "To Do" list to change this as it is really problematic for
non-english text.
The other thing, we are also planning to allow for one to search via
transliterations. Though this is a bit problematic as there may be more
than one transliteration scheme. Fuzzy search can help (levenshtine
distance) here, but fuzzy searching can lead to surprising results.
Adding the former is not too difficult. I haven't looked into the latter.
Basically, the criteria is that the search request and the index need to
be created by normalizing the data in the same fashion. We'll be using
ICU4J to do the normalization. Robert Muir, a contributor for Lucene,
has come up with an excellent mechanism that works well for Unicode data
independent of the documents language. It has not been release yet, but
it looks very promising and I'll be experimenting with it before too
long and may, with his permission, use it before it is released.
Glad to work together on this.
In Him,
DM
On 11/23/2009 01:04 PM, Troy A. Griffitts wrote:
> David,
>
> We have done the work to enable the stripping on vowels, accents, and
> such for Greek, e.g.,
>
> http://crosswire.org/study/wordsearchresults.jsp?mod=PHI_CHR&searchTerm=μακαρ*
>
> We can do the same thing for Hebrew-- we have the filters to turn
> cantillation, accents, and such on and off. We just need to register
> these filters in the search phase before the search.
>
> I don't know how JSword does this. Maybe DM can comment.
>
>
> Troy
>
>
>
> Tyndale STEP Project wrote:
>
>> Chris, (or maybe Troy) does JSword have a Hebrew search facility?
>> The web version has a relatively flexible search facility for NASB
>> (see http://crosswire.org/study/powersearch.jsp - incl Reg expressions)
>> but I don't know how to change this to Hebrew.
>> I suspect you can only search for Hebrew exactly as it occurs.
>>
>> What you said about stripping out vowels is exactly right.
>> Using this transliteration we should be able to search the Hebrew text
>> easily,
>> (ie we can use it to search the real Hebrew as well as the
>> transliterated Hebrew).
>>
>> Colin, I've had a go at making the Hebrew transliteration look more
>> conventional.
>> It occurred to me that there is no reason why we shouldn't display a
>> dot under
>> H for /chet /and T for /tet/ and tell people to ignore accents and dots
>> when typing it in,
>> because the upper case H and T will tell the program what to search for.
>> (When displaying, the upper case is made smaller so it looks like a
>> lower case)
>> We could even use upper case S with an acute accent (displayed smaller)
>> for /shin/.
>>
>> Do you think this will help those who are used to normal transliterations?
>> Or do you think it will confuse those who just want to 'read' it.
>> I've put some examples at the bottom of the page at
>> http://www.tyndalearchive.com/STEP/TEST/testtrans.htm
>>
>> David IB
>>
>>
>> At 20:20 20/11/2009, Tyndale STEP Project wrote:
>>
>>> Hi Guys
>>>
>>> Thanks for including me in the discussions... I must admit it sounds
>>> rather complicated and given my limited (almost no) Hebrew, I'm a bit
>>> drowned in the whole lot...
>>>
>>> Just a thought for the searching, we can store one form of the word
>>> for display, and then for example, automatically remove all the vowels
>>> for searching it (or have two forms for searching)...
>>>
>>> Ie. we can build our own index-lookup tables for those words. For
>>> example, looking up grace could be stored in our index table as
>>> grace | grc
>>> grace | grace
>>>
>>> etc.
>>>
>>> In some databases, you'd use function-based indexes, we can use a
>>> feature of Java DB called generated columns (ie. you insert/update
>>> column A and column B gets populated as a result too, with a different
>>> form of the word/integer/value)
>>>
>>> That will only work as long as we can identify vowels as vowels, which
>>> didn't sound too complicated from your posts...
>>>
>>> Chris
>>>
>>>
>>> --
>>> Posted By Tyndale STEP Project to Tyndale STEP - Programming
>>> <http://tyndalestep-prog.blogspot.com/2009/11/hebrew-transliterations.html>
>>> on 11/20/2009 12:20:00 PM
>>>
>>
>> --
>> Posted By Tyndale STEP Project to Tyndale STEP - Programming
>> <http://tyndalestep-prog.blogspot.com/2009/11/re-tyndale-step-programming-hebrew_23.html>
>> on 11/23/2009 09:34:00 AM
>>
>
No comments:
Post a Comment