Friday, 20 November 2009

Re: [Tyndale STEP - Programming] Re: [Tyndale STEP - Programming] Re: Hebrew tran...

Tyndale STEP Project wrote:
> Colin, I think you may have solved all the issues - great!

I hope so, although I've now discovered a further thing we need to
consider, which is whether everything can be cut-and-pasted correctly
and work. (Cut-and-paste into other documents, and into our search box
from our text.)

> And I hang my head in shame about the composite shewa.
> If you ever want to put me in my place, you can remind me about that.

... and you can reciprocate when you see what I have to admit below
about dictionary ordering!

> The big problem I had was the sin/shin one, and your ss/sh solution is
> perfect.
> My main problem with the 'standard' systems is that they simply can't be
> typed.
>
> All word processors can underline but have you tried putting a dot under
> narrow and wide letters in Word?
> (I have a macro for it, but it isn't perfect!).
> Displaying dots under letters on our webpages won't be easy either.
> There aren't standard glyphs available.

I think we may have been talking at cross-purposes a bit here, because I
now realise there are two different ways to underline things.

What I was assuming was that we would be using an accented letter for
these symbols. This is officially what you're meant to do.
Unfortunately, there aren't the symbols we need in the basic Latin-1
codepages, so we'd have to go to full unicode, specifically tables like
http://www.unicode.org/charts/PDF/U1E00.pdf
If we do this, it doesn't matter whether we use dot or underline, since
they're equivalent. (Well, assuming they're all there, of course.)

What I think you're suggesting is using underlining text instead of the
underbar, which would be <u>h</u> in HTML.

This is simpler to type, but has a couple of problems: firstly if you
have two successive letters underlined, there will be a continuous bar
under both, rather than two separate bars, which will look odd. More
importantly though we need to check whether it will be preserved if the
text is cut and pasted into something else.

It would also not be too difficult to construct Word (etc) macros to
automatically generate the unicode symbols from the right input. This is
trivial using the Autocorrect options in OpenOffice - I imagine a
similar feature exists in Word.

However, the problem with unicode is that if we want to distinguish the
"waw-yod" vowels as well, then the combined forms we'd like don't exist,
unless we make them out of combined characters (which are not
well-enough supported to be safe in my opinion - can explain further if
needed).

> BTW I didn't mean to imply dropping the distinction between aleph and ayin.
> The /'_ir_/ example was a typo.
>
> A lot of systems allow the use of "(" or ")" instead of curly single
> quotes.
> I'm not sure we can allow the dropping of these, or the amalgamation into '
> because it will make lexical lookups very difficult.

Agreed - as I said I didn't think you were proposing doing this, but
wanted to point out the problems just in case you were.

> So, the alphabet would be:
>
> )-` b-v g d h w z _h_ _t_ y-î k l m n s (-´ p
> _s_ q r ss/sh t/th
>
> The display alphabet could use curly single quotes and and we could allow
> users to type ) and ( instead.

(hopefully ' and ` as well, since these are also standard. We could
transform one into the other since we'll need to do a first pass anyway
to strip accents off vowels and convert the underline/underdot form into
capitals in the event that someone cuts and pastes a word from our text
into the search box).

I still want to put a vote in for k-kh and p-ph, as (regardless of what
modern Hebrew does) this is the way that all the textbooks tell you to
pronounce them, and I've only ever heard people talk about "nephesh",
not "nepesh", etc.

> But I'm not sure about the vowels, especially e.
> I'd like to keep simple "e" for shewa, so the vowels would be:
>
> a â e ? ê êy i î o ô u û
>
> I take your point about ë for tsere - it causes unnecessary confusion
> though I'm not sure what to do about segol.
> I'd like to use "e" for shewa for simplicity of transliteration, so we'd
> need a different accent for segol.
> Suggestions?
> Also, any suggestions for qames hatuph?

Shewa is a difficult one, and I don't think I've seen something I'd
regard as a good ASCII solution. The system I've seen which uses e for
it also uses e for segol (which I think is a mistake, since they ought
to be distinguished).

Two approaches come to mind, taking into account both issues:

1. a,e,i,o,u are used for the grammatically short vowels (including
segol and qames hatuph)

â... are used for the grammatically long vowels, including both holem
and waw-holem, which are essentially the same (and interchangable in
many cases) and would be distinguished by the underlining.

Shewa gets a different symbol, as it isn't a vowel at all. My suggested
choice would be @, which looks a bit like a vowel, and is not _that_
dissimilar from the upside now e that is used for the sound in IPA.

(We could display as superscripted e as is common but this is risky for
the same cut-and-paste reasons as noted before.)

2. If we're set on e for schwa, then e with overbar is the natural
choice for segol as a midlength vowel. Having broken things a bit in one
respect, we'd probably then also go for o for qames hatuph, o with
overbar for holam, o with circumflex for waw-holam.

1 feels more logical to me, but 2 would work.

One unrelated issue:
1. We've not yet accounted for tsere + yod, which I assume gets ey -
possibly plus overbar - rather than êy.)

> I didn't know about diaresis for non-diphong (eg hammaïm).
> Do you know if this works for every case? - ie does this problem only
> happen with "i"?

I believe that the only ways you can get two vowels in a row without an
intervening consonant are the -aim dual ending and the furtive patah.
The latter would be worth marking as ä if we can, as words like noa_h_
(Noah) could otherwise be read with a dipthong. (ea is similarly
confusing, but I don't know whether it occurs in the OT. ia and ua are
less likely to be read as dipthongs, but I'd mark all for consistency -
it's a departure from the usual consonant plus vowel alternation worth
noting in any case.)

This would be worth checking with a Hebrew scholar who is confident
enough to tell us definitively whether there's any other possibilities.

> I'm also uncertain about how to tell people to type in a word.
> I don't want them to bother with the accents on vowels, cos they are so
> hard to type,
> but I do want them to mark the presence of letters used as vowels.
> (I'm not sure why you think this is unimportant - how could they look up
> an entry with some letters missing?

The problem here was that I was under the misapprehension that the
dictionaries we have (in Hebrew ordering) disregarded yod and waw when
used as a vowel. This isn't the case (for NIDOTTE and
Brown-Driver-Briggs at least anyway) so we do need to have the vowels in
place to do the lexicon lookup. Apologies for getting this wrong before.
(I still wouldn't display them if we didn't need them for this purpose
however.)

If we're going to ask the user to type them, capitals for entry seems
sensible. For display we would want something that cuts and pastes
correctly, and the answer isn't obvious.

Underlining using <u> is not likely to be preserved under cut-and-paste
safely.

Functional but ugly is to display them in capitals too.

However, looking at this from the user's perspective, it's likely to
seem strange to them that while we don't care about accents on most
vowels, we do care about the underlined ones. Could we simplify things
for them by inserting the extra waws and yods behind the scenes? This
would only require a lookup table which knows the words in the OT
transliteration and can map them back into the requisite Hebrew form.
We may get a few homophones introduced this way, but not too many I
think. (Many will be words which appear both in holam and waw-holam
forms anyway.)

I suspect we're going to need most of this mechanism anyway for handling
inflected forms - each word in the text will be tagged with its root form.

Colin

No comments:

Post a Comment