Friday, 20 November 2009

Transliteration - using small caps

Colin, I think the problems are being beaten one by one.

You point out that Unicode has glyphs for dots under characters,
but you also point out the that not all fonts contain those glyphs.
This is a big problem.
The ones we need in particular are h. (1E25), s. (1E63)  and  t. (1E6D)
Cardo (an academic font) has these, but Times New Roman and Arial don't.
I think the only standard Windows font containing them is Tahoma,
so if you try to use them, Windows will (hopefully) change the font to Tahoma,
but only for those letters, so they look odd in the word.
I don't expect that many PDAs and phones will have a full-set Unicode font.
We could supply one, of course - eg Cardo.

I agree about p/ph (a typo). I suppose we should have k/kh for completeness.

You have two suggestions for shewa: 
- @ with  e for segol
- e  with e-with-overbar for segol
I don't like the idea of introducing @ which will look strange, esp as there will be so many of them,
but I don't like introducing non-ASCII symbols
Can't we have è é or ë for segol?
If not, I guess we can live with a raised e, (esp as people don't have to type it in)
and then have e for segol.

I like the idea of o for both holem-vav and holem, if we use something else to indicate the presence of vav.
That leaves ô for qames hatuph

I'm turning against the idea of s for tsade. It really doesn't sound like an s. I'd much prefer tz
If so, so you think we could get away with s for sin? The problem I'm thinking of is words starting with sin,
which look rather strange with ss - eg 'hate' = ssânê)
Ug, I guess it will have to do.

I'm not sure about the dieresis for non-dipthongs. But I noticed that one system uses "y",
eg hamayim
As you point out, the problem only occurs with "aim" as in Amora'im (and "ain" as in the name Ca'in)
because when it is a furtive patach the combination isn't a normal diphthong pair (eg No'ah)

Your point about copy and paste is really important. As you say, <u> won't copy and paste.

I had a possible brainwave - what about small caps?
They look OK when they are written,
and are preserved with copy-and-paste.
You can't always easily distinguish a small-cap vowel from the lower case letter
(esp with o and u) but this doesn't matter much,
because we aren't making distinctions about pronunciation.
But small cap T and H look different from t and h.
Also, we can use ")" and "(" as reduced-size superscript for aleph and ayin.
These too will copy and paste in the same way someone might type them.
Superscript "e" for shewa works OK too.

See an example at
www.tyndalearchive.com/STEP/TEST/testTrans.htm
- copy and paste the two lines into the text box and see what happens

A few things which came up while I was doing the transliteration:
- Sometimes it is best to omit the "e" for shewa - such as yihyeh near the end of Gen.1.29
- the transliteration looks better with two spaces between words, esp when ending and starting with aleph or ayin

What do you think?
It doesn't look too complicated to read out.
So long as we tell people not to bother with accents, it is fairly easy to write out.
It works with copy+paste
The only 'missing dots' under letters are for chet and tet, which are small cap H and T
Does it look sufficiently like/dislike other systems so as to cause little confusion?

David IB

At 12:01 20/11/2009, Tyndale STEP Project wrote:
Tyndale STEP Project wrote:
> Colin, I think you may have solved all the issues - great!

I hope so, although I've now discovered a further thing we need to
consider, which is whether everything can be cut-and-pasted correctly
and work. (Cut-and-paste into other documents, and into our search box
from our text.)

> And I hang my head in shame about the composite shewa.
> If you ever want to put me in my place, you can remind me about that.

... and you can reciprocate when you see what I have to admit below
about dictionary ordering!

> The big problem I had was the sin/shin one, and your ss/sh solution is
> perfect.
> My main problem with the 'standard' systems is that they simply can't be
> typed.
>
> All word processors can underline but have you tried putting a dot under
> narrow and wide letters in Word?
> (I have a macro for it, but it isn't perfect!).
> Displaying dots under letters on our webpages won't be easy either.
> There aren't standard glyphs available.

I think we may have been talking at cross-purposes a bit here, because I
now realise there are two different ways to underline things.

What I was assuming was that we would be using an accented letter for
these symbols. This is officially what you're meant to do.
Unfortunately, there aren't the symbols we need in the basic Latin-1
codepages, so we'd have to go to full unicode, specifically tables like
http://www.unicode.org/charts/PDF/U1E00.pdf
If we do this, it doesn't matter whether we use dot or underline, since
they're equivalent. (Well, assuming they're all there, of course.)

What I think you're suggesting is using underlining text instead of the
underbar, which would be <u>h</u> in HTML.

This is simpler to type, but has a couple of problems: firstly if you
have two successive letters underlined, there will be a continuous bar
under both, rather than two separate bars, which will look odd. More
importantly though we need to check whether it will be preserved if the
text is cut and pasted into something else.

It would also not be too difficult to construct Word (etc) macros to
automatically generate the unicode symbols from the right input. This is
trivial using the Autocorrect options in OpenOffice - I imagine a
similar feature exists in Word.

However, the problem with unicode is that if we want to distinguish the
"waw-yod" vowels as well, then the combined forms we'd like don't exist,
unless we make them out of combined characters (which are not
well-enough supported to be safe in my opinion - can explain further if
needed).

> BTW I didn't mean to imply dropping the distinction between aleph and ayin.
> The /'_ir_/ example was a typo.
>
> A lot of systems allow the use of "(" or ")" instead of curly single
> quotes.
> I'm not sure we can allow the dropping of these, or the amalgamation into '
> because it will make lexical lookups very difficult.

Agreed - as I said I didn't think you were proposing doing this, but
wanted to point out the problems just in case you were.

> So, the alphabet would be:
>
> )-` b-v g d h w z _h_ _t_ y-î k l m n s (-´ p
> _s_ q r ss/sh t/th
>
> The display alphabet could use curly single quotes and and we could allow
> users to type ) and ( instead.

(hopefully ' and ` as well, since these are also standard. We could
transform one into the other since we'll need to do a first pass anyway
to strip accents off vowels and convert the underline/underdot form into
capitals in the event that someone cuts and pastes a word from our text
into the search box).

I still want to put a vote in for k-kh and p-ph, as (regardless of what
modern Hebrew does) this is the way that all the textbooks tell you to
pronounce them, and I've only ever heard people talk about "nephesh",
not "nepesh", etc.

> But I'm not sure about the vowels, especially e.
> I'd like to keep simple "e" for shewa, so the vowels would be:
>
> a â e ? ê êy i î o ô u û
>
> I take your point about ë for tsere - it causes unnecessary confusion
> though I'm not sure what to do about segol.
> I'd like to use "e" for shewa for simplicity of transliteration, so we'd
> need a different accent for segol.
> Suggestions?
> Also, any suggestions for qames hatuph?

Shewa is a difficult one, and I don't think I've seen something I'd
regard as a good ASCII solution. The system I've seen which uses e for
it also uses e for segol (which I think is a mistake, since they ought
to be distinguished).

Two approaches come to mind, taking into account both issues:

1. a,e,i,o,u are used for the grammatically short vowels (including
segol and qames hatuph)

â... are used for the grammatically long vowels, including both holem
and waw-holem, which are essentially the same (and interchangable in
many cases) and would be distinguished by the underlining.

Shewa gets a different symbol, as it isn't a vowel at all. My suggested
choice would be @, which looks a bit like a vowel, and is not _that_
dissimilar from the upside now e that is used for the sound in IPA.

(We could display as superscripted e as is common but this is risky for
the same cut-and-paste reasons as noted before.)

2. If we're set on e for schwa, then e with overbar is the natural
choice for segol as a midlength vowel. Having broken things a bit in one
respect, we'd probably then also go for o for qames hatuph, o with
overbar for holam, o with circumflex for waw-holam.

1 feels more logical to me, but 2 would work.

One unrelated issue:
1. We've not yet accounted for tsere + yod, which I assume gets ey -
possibly plus overbar - rather than êy.)

> I didn't know about diaresis for non-diphong (eg hammaïm).
> Do you know if this works for every case? - ie does this problem only
> happen with "i"?

I believe that the only ways you can get two vowels in a row without an
intervening consonant are the -aim dual ending and the furtive patah.
The latter would be worth marking as ä if we can, as words like noa_h_
(Noah) could otherwise be read with a dipthong. (ea is similarly
confusing, but I don't know whether it occurs in the OT. ia and ua are
less likely to be read as dipthongs, but I'd mark all for consistency -
it's a departure from the usual consonant plus vowel alternation worth
noting in any case.)

This would be worth checking with a Hebrew scholar who is confident
enough to tell us definitively whether there's any other possibilities.

> I'm also uncertain about how to tell people to type in a word.
> I don't want them to bother with the accents on vowels, cos they are so
> hard to type,
> but I do want them to mark the presence of letters used as vowels.
> (I'm not sure why you think this is unimportant - how could they look up
> an entry with some letters missing?

The problem here was that I was under the misapprehension that the
dictionaries we have (in Hebrew ordering) disregarded yod and waw when
used as a vowel. This isn't the case (for NIDOTTE and
Brown-Driver-Briggs at least anyway) so we do need to have the vowels in
place to do the lexicon lookup. Apologies for getting this wrong before.
(I still wouldn't display them if we didn't need them for this purpose
however.)

If we're going to ask the user to type them, capitals for entry seems
sensible. For display we would want something that cuts and pastes
correctly, and the answer isn't obvious.

Underlining using <u> is not likely to be preserved under cut-and-paste
safely.

Functional but ugly is to display them in capitals too.

However, looking at this from the user's perspective, it's likely to
seem strange to them that while we don't care about accents on most
vowels, we do care about the underlined ones. Could we simplify things
for them by inserting the extra waws and yods behind the scenes? This
would only require a lookup table which knows the words in the OT
transliteration and can map them back into the requisite Hebrew form.
We may get a few homophones introduced this way, but not too many I
think. (Many will be words which appear both in holam and waw-holam
forms anyway.)

I suspect we're going to need most of this mechanism anyway for handling
inflected forms - each word in the text will be tagged with its root form.

Colin

--
Posted By Tyndale STEP Project to Tyndale STEP - Programming on 11/20/2009 04:01:00 AM

No comments:

Post a Comment