Author Topic: Browsing Wikipedia offline on OS 9  (Read 593 times)

Offline OS923

  • Gold Member
  • *****
  • Posts: 427
Browsing Wikipedia offline on OS 9
« on: February 07, 2019, 05:05:02 AM »
WikiTaxi stopped working on my Windows computer (out of memory). Then I switched to BzReader. This has the advantage that you can select and copy the text, style and links. Unfortunately it doesn't understand templates, just like WikiTaxi.

I did some calculations for having Wikipedia offline on OS 9. It looks like a realistic plan. A Wikipedia data dump XML is now around 66 GB. The longest page title is 266 characters. A plain text index is around 750 MB. I could split this index into 676 files like aa.idx, ab.idx and so on. If I want page "aax" then I search "aax" in aa.idx, which requires reading around 1.1 MB. There I find the offset of the "<page>" tag in the XML. Then I need to read a few lines and convert the Wikitext to HTML. It should be possible to do this in a fraction of a second and with less than 5 MB of memory.

The idea is to install a small HTML page and AppleScript CGI bin in MacHTTP that communicates via AppleScript with my program.

How the Wikitext has to be converted to HTML an how formulas can be converted to pictures can be found in the source code of BzReader. It doesn't seem too difficult.

Offline OS923

  • Gold Member
  • *****
  • Posts: 427
Re: Browsing Wikipedia offline on OS 9
« Reply #1 on: February 13, 2019, 06:36:16 AM »
The indexing program works already. This was one day of work. The speed is comparable to similar programs for Windows. The solution will continue to work until the uncompressed Wikipedia data dump XML is around 1 TB.

This is just an intermediate project because my Windows solution was insufficient.

The classes for reading and writing the index will be published as a library. The translation of Wikitext and the rendering of formulas will be open source plugins.