Author Topic: Browsing Wikipedia offline on OS 9  (Read 1475 times)

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Browsing Wikipedia offline on OS 9
« on: February 07, 2019, 05:05:02 AM »
WikiTaxi stopped working on my Windows computer (out of memory). Then I switched to BzReader. This has the advantage that you can select and copy the text, style and links. Unfortunately it doesn't understand templates, just like WikiTaxi.

I did some calculations for having Wikipedia offline on OS 9. It looks like a realistic plan. A Wikipedia data dump XML is now around 66 GB. The longest page title is 266 characters. A plain text index is around 750 MB. I could split this index into 676 files like aa.idx, ab.idx and so on. If I want page "aax" then I search "aax" in aa.idx, which requires reading around 1.1 MB. There I find the offset of the "<page>" tag in the XML. Then I need to read a few lines and convert the Wikitext to HTML. It should be possible to do this in a fraction of a second and with less than 5 MB of memory.

The idea is to install a small HTML page and AppleScript CGI bin in MacHTTP that communicates via AppleScript with my program.

How the Wikitext has to be converted to HTML an how formulas can be converted to pictures can be found in the source code of BzReader. It doesn't seem too difficult.

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Re: Browsing Wikipedia offline on OS 9
« Reply #1 on: February 13, 2019, 06:36:16 AM »
The indexing program works already. This was one day of work. The speed is comparable to similar programs for Windows. The solution will continue to work until the uncompressed Wikipedia data dump XML is around 1 TB.

This is just an intermediate project because my Windows solution was insufficient.

The classes for reading and writing the index will be published as a library. The translation of Wikitext and the rendering of formulas will be open source plugins.

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Re: Browsing Wikipedia offline on OS 9
« Reply #2 on: September 16, 2019, 08:51:50 AM »
"Build index" is replaced with "Index Wikipedia". It works with multiple languages. In the example I use English, French and Dutch. I'm now working on "Offline Wikipedia".

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Re: Browsing Wikipedia offline on OS 9
« Reply #3 on: September 20, 2019, 08:58:20 AM »
Offline Wikipedia works already. The pages use an AppleScript CGI which uses AppleScript to communicate with the Offline Wikipedia program. It's fast and Unicode was never an issue.

Unfortunately, my pages are cut after 32K. It looks like the author of MacHTTP limited the text that can be returned by an AppleScript CGI to 32K because he thought that strings in AppleScript are limited to 32K.

I do a simple conversion from Wikitext to HTML: I replace the special characters with spaces. I'll do the correct translation when I've solved the 32K problem.

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Re: Browsing Wikipedia offline on OS 9
« Reply #4 on: September 22, 2019, 01:19:05 AM »
I copied the source code of MacHTTP. I spent 2 hours trying to make this work in the hopes of changing the 32K limit. Nothing but crashes.

Then I switched to Apple's "Web sharing". My URLs don't look so nice, but it worked immediately and without 32K limit.

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Re: Browsing Wikipedia offline on OS 9
« Reply #5 on: September 23, 2019, 09:05:51 AM »
I linked MimeText 1.77 to convert formulas to pictures. It works already as a shared library. I find it OK, but when I use antialiasing then some characters are not "closed".

Offline IIO

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 2227
  • new to the forums
Re: Browsing Wikipedia offline on OS 9
« Reply #6 on: September 24, 2019, 04:08:39 PM »
seems readable from the third on (24pt?)
"It is true that the "pre-emptive multitasking" advantage present in OS X can be illustrated by downloading CD-ROM ISOs and rendering chaos theory formulas while simultaneously instant messaging and posting on FaceBook what you ate... but in reality, what did you create?"
- DieHard, random forum troll at macos9lives.com

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Re: Browsing Wikipedia offline on OS 9
« Reply #7 on: September 25, 2019, 01:57:17 AM »
It's MimeTex, not MimeText. One feature doesn't work (calendar) because it uses too much memory. Antialiasing and transparency are optional.

Offline IIO

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 2227
  • new to the forums
Re: Browsing Wikipedia offline on OS 9
« Reply #8 on: September 25, 2019, 01:47:29 PM »
or print to a bitmap?
"It is true that the "pre-emptive multitasking" advantage present in OS X can be illustrated by downloading CD-ROM ISOs and rendering chaos theory formulas while simultaneously instant messaging and posting on FaceBook what you ate... but in reality, what did you create?"
- DieHard, random forum troll at macos9lives.com

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Re: Browsing Wikipedia offline on OS 9
« Reply #9 on: September 26, 2019, 02:36:20 AM »
It can return a picture as a file or in stdout. I changed that to return it in memory instead of stdout. I ask to return it in memory. Then I copy it into the reply of the Apple event that was received from the AppleScript CGI.

Offline OS923

  • Platinum Member (500+ Posts)
  • *****
  • Posts: 504
Re: Browsing Wikipedia offline on OS 9
« Reply #10 on: September 26, 2019, 02:41:51 AM »
I found here an interesting program which converts Wikitext to XML:
https://dizzylogic.com/wiki-parser/

Unfortunately, it's not exactly like Wikipedia:
Quote
Wiki Parser currently omits tables in Wikipedia pages as they are almost impossible to present in textual format. It also flattens multi-level lists (but keeps every list element in its own XML node).

It's open source. This shows how to handle templates.