need - ie Bible people.
It would be good to have Bible places as well. Other subjects aren't
so important.
The most important bit is the text, not the pictures, and esp the
Hebrew & Greek which are missing from the older dictionaries.
It is sufficient to be tied up with the same headwords as the other
articles (if possible - not if it means manual work).
Does the Hebrew & Greek come off the web as Unicode? Prob it would be
best to save them as RTF or DOC
(RTF is theoretically more universal, but actually there are
different versions of RTF, so DOC is prob better, paradoxically).
Wrt pictures, we'll probably leave these off the distributed version,
because they take up disproportionate space,
and link to something like the Biblos picture search, which gathers
pictures from several sources.
Thanks
David IB
At 10:38 13/01/2010, Colin Bell wrote:
>David Instone-Brewer wrote:
>>At 10:03 11/01/2010, Colin Bell wrote:
>>>David,
>>>
>>>See STEP-Hist/Dictionary/dict for a whole pile of data files and a
>>>README as to what's there. Should be fine for you to convert to
>>>get some data to Sarah, but let me know if there's problems. Will
>>>be sorting things out and tidying the scripts up in the next few days.
>>>
>>>Colin
>>The new files have come through OK.
>>The organisation and quality of the data is fantastic! - Thanks.
>>Did you ever manage to scrape the Wiki articles? I seem to remember
>>the Mobi wiki was particularly good (for some reason). In
>>particular, I was interested in the pointed Hebrew which was included.
>
>David,
>
>I got part of the way through this. I have something which downloads
>the text from articles from Wikipedia into the "wiki" format, i.e.,
>what you'd type into the edit box. It uses the official Wikipedia
>API so is more reliable than trying to scrape.
>
>Remaining issues would be:
>1. what articles to download. I started from the "Biblical people"
>category, and things in subcategories of that, cutting off where
>things were way off topic. The results include lots of stuff we
>don't want but hopefully all the data we do.
>2. format of articles. What we have can be read for the purpose of
>(human) editting, but they're not suitable for putting into STEP
>directly. This could probably be dealt with with a few days' work.
>3. images. I couldn't get this to work with the API I tried. Either
>I need to try again and fix it, or we go for a direct scraping
>approach (not too hard).
>
>I'm having to redownload some stuff at the minute to cope with
>unicode issues. Will ZIP up all the text files I have and put them
>in the dropbox, probably later today. (Would be fairly trivial to
>convert to the same format as the other files if you wanted them like that.)
>
>Colin
No comments:
Post a Comment