Friday, 4 December 2009

Dictionary samples

I've now got a proof-of-concept dictionary and Bible which I've put in
the dropbox under STEPHist/DictionarySamples.

It comprises a copy of the KJV and Easton, both marked up automatically,
and shows what can be done with minimal human input. Fire up any of the
HTML files and have a play.

The source files from which it was generated are also included, as
"dict.txt" and "bibleLink.csv". The former is the dictionary database;
the latter a list of articles expecting to be referenced from each verse
in the Bible. (As previously discussed, it would be better to use ranges
than individual ranges, but I haven't written the code to do this.)

To be more precise in what has been done:

In pre-processing:
* The source dictionary gets parsed to identify Bible references and
these are marked up using the [[ | ]] notation. (I also updated Easton
at this point to convert his book abbreviations to the OSIS ones,
although inserting a space in cases like "1 Chr").
* Both the dictionary and the KJV are then searched for the single-word
headwords in Easton. The dictionary is again marked up using the Wiki
notation, but for the Bible we just maintain a list of what articles occur.

At run-time:
* The dictionary is just converted to HTML, turning [[]] into HTML links.
* The bible is searched verse by verse for the words we expect to be
there, and those found have a link inserted.


To add/fix:
* Each article will get a "main reference" (which will default to the
first reference found in the article) and a "timeline reference" (the
point on the timeline referred to by the first reference).
* There's a bug at the minute in that part of a word can be matched.
* The dictionary needs to be converted to use ESV spellings not KJV
ones. I know roughly how to do this mostly automatically.
* Add some automatic rules to map plural proper nouns where the headword
is a singular, or vice versa.

Observations and notes:
* Cases where the headword is more than one word can be handled, but may
need some manual checking to confirm what phrase in the Bible or
dictionary wants to link to it.
* There are a lot of "false positives" at the minute. I told the script
to ignore the word "A" and a link to the same article (so the word
"David" in the "David" article is unlinked). There's several other words
we want to omit (So, On, No, By, each of which are proper names but
common words in English). Other common words like "city, east, queen,
sun" may not want linking every time they appear - it depends on how the
links are displayed in the final interface.
* Finding the Bible references is not 100% reliable (unsurprisingly),
mainly due to the fact that when you see a reference to 6:5 you need to
have understood the text preceding to know which book it's talking
about. I've assumed it's always the same book as the previous reference,
which is usually the case. However, some will need fixing during
checking. (It's not terribly fast either.)
* The dictionary should be marked up during development (no real
advantage to doing so at run-time) so we can fix things like this and
also improve what's linked to - if we're going to edit the article
anyway, worth putting this into the mix.
* We're going to have to mark "unknown" Bibles up completely
automatically anyway. We could consider doing the "main" translations in
development, although I think that doing them completely automatically
(either during pre-processing or at run-time) should be fine, at least
in version 1.


I'm going to tidy up and add the processing scripts (which are in Perl)
as well soon (probably on Monday), which will give Chris a heads-up in
how to parse the input, not that it's too complicated.

Just a warning that I'm moving (just a mile down the road) on the 12th,
so will be a little quiet next week and the week after - I may well have
no internet access for a few days after the move, depending on how fast
I can get broadband installed.

Colin

No comments:

Post a Comment