Recovering Digital Magazine Content

Published Friday, February 23, 2024 by Bryan

A software post?! I thought this blog was all motorcycling and woodworking now!

Well, it started with woodworking.

Two DVDs clipped into left and right sides of a folding case.
American Woodworker, 25 Years of Issues: March 1985 through March 2010. A two-disc set.

I'm working on reorganizing my shop. While putting away the magazines that I had been flipping through for ideas, I came across a set of DVDs my father-in-law gave me. They contain digital editions of the first 25 years of the magazine American Woodworker. They're essentially unusable as-is now, because they're delivered as Adobe Flash apps. I don't have anything that runs Flash anymore.

A web browser's URL bar listing path 'file:///Volumes/American Woodworker - Disc1/disc1/issues/1985/1985_01/index.html', with content below it reading 'Alternate HTML content should be placed here. This content requires the Adobe Flash Player. Get Flash'
The page that greeted me when I tried to view one of these DVD magazine issues.

But that's silly. I'm a computer professional. It's just bytes. The articles are on there somewhere. And so began a game. Could I figure out how the Flash apps were loading and displaying the data for these issues, and recover it for my own use?

Spoiler: I recovered it all. And if you have a set of these DVDs as well, I have published the scripts I used to do it on Github. Continue reading here if you're interested in the software version of those wonderful rusty-old-machine-restoration Youtube videos.

I needed to figure out two major things to complete this project:

  1. What format was the magazine data currently in?
  2. What format did I want it to be in?

I started with the first, because whether I could access it at all would determine whether I needed to spend any time at all on the second.

Extract and Import

Images

The URL in the address bar above the Flash-required message gave a small amount of hope. A heirarchy of directory-per-year containing directory-per-issue meant we probably weren't dealing with one giant database. Looking at the other files and directories in that path wasn't promising at first.

View of Mac Finder window, with contents described in the next paragraph.
Directory listing of files at the same path as index.html

The Images/ directory was UI icons, not magazine content. Similar with txt/ - it was the translations of UI text into other languages, not the articles. I was worried I'd have to un-minify one of the Scripts/ or decompile an exe. But then I looked in Vol_1_No_1_March_1985/. That contained a directory likely named for the hash of some data, and in there...

Ver of Mac Finder window, with contents described in the next paragraph.
Directory listing of files in the interesting subdirectory.

Bingo! The format of the issue data could have been anything. It turned out to be the most obvious thing: JPEG scans of each page. Well, almost obvious JPEG scans. In each issue's data directory are two images of each page. One is Thumbnail_N.jpg, a 175x240 pixel preview image. The other is Page_N.jpg, a larger 750x1024 pixel version. The text is a little grainy on the second, but basically readable.

Dovetail joints are an essential and important part of fine cabinetmaking. In figure 1 a method for detrming the angle of dovetails is pictured. this angle of about ten degrees is not an arbitrary angle, and slight variations will be found on furniture pieces, as well as on antiques you may see. the sliding T-bevel can be set to this angle to draw the dovetail angles; this step is being done in figure 3.
Sample text from a 750x1024 pixel page image. Text from the article 'How To Lay Out and Make Dovetail Joints' by Franklin H. Gottshall, as printed on page 32 of American Woodworker, Volume 1 Number 1, March 1985.

Right next to each JPEG there is also a Zoom_Page_N.swf. That's the Flash content the web pages warned me about. I ignored these files for a while, assuming they were useless to me - unreadable and likely just viewer-app wrappers around the JPEGs I could already see. I was curious why they were roughly double the size of their JPEG counterparts, though.

Monospace terminal output of the swfdump command listing things like file size, frame count, etc. One line shows '[015] 398799 DEFINEBITSJPEG2 defines id 0003'.
The contents of that page's matching Zoom_Page.swf file.

A quick bit of searching led me to the swftools suite. The swfdump tool showed me a list of the contents of one of the Zoom_Page files, I learned that each held a few small (tens of bytes) “shape” and “object” definitions ... and one large JPEG. Literally almost the entire file was another plain JPEG image. The swfextract tool pulled one out for me, and I found I had a third scan of the page - this one at a whopping 1498x2054 pixels. No grainy text here - these are easy to read.

The same sample text, but from the 1498x2054 pixel image.

I've decided to cut the DVD creators some slack for using this awful tech. The file creation dates say 2010. I know what hardware I was using in 2010, and I know what hardware I had to develop for in 2010. Retina displays didn't exist yet. A typical 15-inch laptop was 1280 pixels wide, and anything smaller was likely 1024. Showing someone a 1498x2054 pixel image without obvious ways to zoom and pan (as some of that era's browsers might have done) would have been pointless.

But in 2024, there are few screens without that many pixels. My five year old iPad has just a few more in each direction. I'll keep these full-resolution images and ignore the rest.

Text

One of the things I'd like to be able to do with these back issues is to search for project descriptions. The reasons I had magazines to put away is because I had been scouring through them for ideas for my shop reorganization. It would have been great to search for “scrapwood storage”, “outfeed table”, “tool wall”, etc.

Luckily, the DVD system wanted to offer this feature too. In each issue's data directory is a file called DocumentText.xml. It's obviously the output of an OCR (Optical Character Recognition) program run over each page. I say “obviously” because it's got all the sorts of errors I've seen in OCR output before - incorrect characters, strange spaces in the middle of words, etc. But it's there, and it's easy to read.

The text is nearly the same as seen in the images, with the exception of the large letter D that begins the paragraph being on its own line (with the next line beginning 'ovetail'), and a capital I beginning the word 'important' instead of a lowercase i.
OCR text of the paragraph in the earlier examples.
In a serif font, 'How To Lay Out and Make Dovetail Joints'. Below it in an italic font, 'By Franklin H. Gottshall'. Inset next to them, in a monospace font, 'Ho w T o La y Ou t an d Mak e Dovetai l Joint s B y Frankli n H . Gottshal l'
The title and attribution lines for the article, with OCR translation inset.

And maybe I'm too harsh on the quality of the OCR. I also tried to re-OCR a few images using the built-in MacOS tool called sips. It produced similar quality output - the same sorts of errors, just in different places.

Metadata

Surprisingly, the content of these magazines (images and text from the pages) turned out to not be that complicated to extract. What about the metadata? Title? Publication date? Table of contents? Here is where the fun began - where human and machine collided.

I'm not hand-converting each issue. There are over 150 of them. I'm writing a script that automatically pulls the correct data from each issue. That process is easier when data for each issue is in the exact same, standard format, and harder when there are differences or the standard is broken.

In this case, the standard is XML, in the form of a convenient document.xml file right next to all the images. I've already mentioned another XML file, DocumentText.xml. While technically the same standard, DocumentText.xml is simple, looking like an extremely low-effort bundling of some plain text into the minimal structure that would allow the Flash app to access it easily. In contrast, document.xml is far more complex - multiple tag types, redundant attributes, confusion about when to use attributes or content. There is a clue near the top that I think hints at the creation process of these DVDs: a filename tag whose content is a filename ending in .pdf. I'd wager that most of the story of this DVD's creation is that a PDF of each issue was run through a tool that ripped it apart and re-wrapped it in a Flash app. I'll come back to this point later.

For now, there is also a title element in document.xml. Hooray! The only awkward thing is that it's more like the subtitle of the issue. “Vol 1 No 1 March 1985” says the first one. The name “American Woodworker” is nowhere to be found. This isn't a huge problem for my script (I know all of the issues are American Woodworker, so I'll just hardcode it), but it still seems odd. Each issue is packaged in such a way that it could be shipped on its own, yet except for the image and OCR text, nothing says what magazine it is.

Even stranger: there is no publishing date recorded anywhere! The closest to it is that month and year in the title. We all know that no magazine publishes its “March” issue in March. The system I wanted to use to re-export the data I'm importing really wanted a date attached to the content, though. So I attempted to parse this information out of the title, and learned that the titles are nowhere near regular. Here are examples of the seven types of titles I found:

  • Vol 1 No 1 March 1985
  • Vol 2 No 1 Spring 1986
  • Vol 4 No 1 March-April 1988
  • Vol 4 No 3 July-August1988 (ed: note the missing space)
  • No 12 February 1990
  • No 41 1995 Tool Buyers Guide
  • No 83 Tool Buyers Guide 2001

Volume, issue, month, and year. Season and year. Month range and year. Monthrangeandyear. Actually, forget the volume. The coming year for the tool guide (sequential issue numbers suggest it was published between the October and December issues). At either end of the title.

This sort of variation is no surprise to anyone that has managed data. New staff will choose different formats. The same staff will choose different formats year to year. Everyone will make mistakes even when trying to follow an agreed format. Ask anyone who uses the title “data scientist”, and they'll tell you - a large part of their time is not evaluating statistical models, it's cleaning up incoming data.

There is one final thing in document.xml that is a really nice extra: a table of contents. I don't have to rely on the OCR of the scan of the table of contents page - someone, somewhere along this production pipeline, created a structured, machine-readable version of it. The only problem is that it often contains invalid XML, and the tools I chose don't like that.

Some of the invalid XML is obvious. “Q&A” (which in many issues is actually “O&A” - thanks, OCR) can't be written that way. The ampersand is a special character in XML. It must be written Q&amp;A. A section listing the page images often begins with a comment - <!----> - but in one issue they attempted a comment within a comment - <!--<!---->-->. That's also not allowed.

But the one that really made me laugh came in Issue No. 42, December 1994. My terminal says someone typed “Maple and MahoganyÑStep by Step” as the entry name. They didn't, obviously. They typed something else where that capital-N-with-tilde is, and this is how my terminal interprets what their publishing software put there. My parser alerted me to the problem not because it's not what was originally typed, but because the byte representing that character in the file is invalid for the encoding this XML file specifies!

Right at the top, this file declared itself to be encoded in UTF-8. The character code that I'm seeing as Ñ, and that my XML parser rejected, is 209, or in hex D1, or in binary 1101 0001. In UTF-8, if a byte has 110. .... as its high-order bits, then next byte must have 10.. .... as its high bits. The S in “Step” doesn't - it's decimal 83, hex 53, binary 0101 0011. I'm seeing this as Ñ because my terminal is doing its best to make up for this very common error, and showing the 209th Unicode code point (which is the same as the 209th Latin-1, or ISO8859-1, character). If the capital-N-with-tilde had been the actual character desired, and it had been properly UTF-8 encoded, it would have shown up as the two bytes: C3 91 in hex.

If the 209/D1 isn't a UTF-8 encoded character, and it isn't Latin-1/ISO8859-1, then what encoding is it? I loaded up the scanned image to find out.

Maple and Mahogany–Step by Step
The table of contents entry, as it appeared on the page.

“Aha!” I thought, “an em dash!” What encoding has an em dash at 209/D1? None of them! But many of them have the em dash at 208/D0. Okay, what about en dash? Exactly one: macintosh. This really surprised me. I strongly suspected one of the oddball Windows encodings that are responsible for smileys showing up as capital J characters in emails. But, I suppose the ties between publishing and Mac are strong.

The convenient iconv utility made quick work of converting all the text from “macintosh” encoding to actual UTF-8. But my XML parser still rejected the file. This time for its first character! What happened?

# Convert from machintosh encoding to UTF-8
% iconv -f MAC -t UTF-8 document.xml > document-utf8.xml

# Print the first line of the UTF-8 version
% head -1 document-utf8.xml
Ôªø<?xml version="1.0" encoding="utf-8"?>

# Look at the byte values of the first line of the UTF-8 version
% hexdump -C document-utf8.xml | head -1
00000000  c3 94 c2 aa c3 b8 3c 3f  78 6d 6c 20 76 65 72 73  |?.ªø<?xml vers|

What in the... That looks like a mess. What was there before conversion?

# Print the first line of the Macintosh version
% head -1 document.xml
<?xml version="1.0" encoding="utf-8"?>

#Look at the byte values of the Macintosh version
% head -c document.xml
00000000  ef bb bf 3c 3f 78 6d 6c  20 76 65 72 73 69 6f 6e  |<?xml version|

ef bb bf? Oh, I've seen this before. It's a byte order mark. Technically, it's not needed for a UTF-8 encoded file, but some use it as an indicator that a file is UTF-8. My XML and printing tools ignored/hid it like they're supposed to. But when I told iconv that the file was not UTF-8, it tried to interpret those first three bytes as machintosh characters, and re-encoded them as the equivalent UTF-8 characters. Since the byte order mark was unnecessary anyway, I just chopped it off before conversion.

I said I laughed when I saw these character problems. I couldn't help it. It's something anyone who has dealt with computerized text in the last twenty years has learned. I remembered learning it myself. Unicode, UTF-8, Latin-1, and ASCII are indistinguishable for the first 128 characters. Unencoded Unicode (i.e. raw code point values) look exactly like Latin-1 for 128 more. It's a very common mistake among European language speakers to assume this means UTF-8 is the same for that range as well. So many tools silently fix the glitch, and it's not until you're wondering why your strict XML parser is failing (or more likely that your emojis aren't rendering) that you learn what's up. It was fun to see the error using a non Latin-1 encoding, and dust off this knowledge to fix it.

Aside

I also have to take a moment to mention the other walk down memory lane. I initially ripped an image of the first DVD to work with, but eventually went back to working directly with the DVD, for reasons. It has been years since I worked with non-solid state storage. The pause after tapping tab to autocomplete a filename. The sound of the drive spinning up. Eject being something more than just safely disconnecting. These brought such nostalgia for hacking in decades gone by. The incremental readjustment to adapt to each issue's quirks hardly felt a slog at all with such visceral reminders of the age that got me into computering.

Re-exporting

So after all that, I have a pile of images, a pile of text, and some information about where those images and text came from. What should I do with it all?

My first thought was, “Nothing.” If what I want to do with this content is search through it, there are few better formats than a pile of simple, plain files. I can use generic, built-in tools, like grep to search through the text and metadata, and Preview to read the page scans. I can easily reference any piece by writing down the path and filename.

But then I started thinking about how it might be nice to have a much more magazine-like experience, flipping through while relaxing on the couch holding my iPad. I know the Files app exists on that device, but it doesn't seem designed for the purpose. I started building a small website generator, thinking I'd host it on my local network, and just read in Safari. I got as far as an HTML page full of thumbnails before realizing I didn't really want to spend time creating UI. There are many book/document readers, and many book/document formats. Surely some existing solutions would suffice.

I considered repackaging the images into PDFs. I'm still annoyed that it's obvious the images came from PDFs, and that the PDFs themselves weren't included. Yes, we all shied away from PDFs in the late 90s when downloading one took forever and froze our browsers. Tech got better ... for reading anyway. Short of dusting off my college LaTeX skills, I didn't find an easy way to put all this content into new PDFs that didn't involve me manually clicking through a bunch of UI over and over.

I asked they internet, and they came up up with a very promising suggestion: the CBR/CBZ format used by comic book and graphic novel collectors. It's just a ZIP full of images whose filenames are their page numbers in the book. Simple, quick ... and surprisingly poorly supported. Calibre will import them and convert them to other formats for you. The apps that would read them directly seem to have vanished - empty domains and broken store links. That's probably for the best (for me) because it also seems that the format never adopted any consistent metadata standard. It really was just the images.

I very nearly revived the DjVu format. Designed specifically for archiving scans! Lisp-based metadata! ... and even worse modern support than CBR, as far as I could tell. This one really felt a shame, and I'm tempted to do a nerdy deep dive on the history of this format, because from a cursory glance, this looks like a format that could have been great.

EPUB

Where I finally landed was EPUB. I'm still not sure it's the right choice, and I am sure there are better ways to build it than what I've cobbled together. But, my workflow has generated an issue for every issue on these DVDs, and Apple Books seems happy to display them. So, success?

Two resources helped immensely in putting this together. The first was the EPUB 3 Samples Project. The EPUB spec is your standard-issue, er, standard: “Here's all the things, good luck.” Good for reference, but not at all a guide. The samples project has a large selection of EPUB documents demonstrating many different styles of “book”. The ones I found most useful were three different versions of a graphic novel, where each page's content was entirely one large image. It's exactly what I needed to make, and why the other comic book format was so appealing.

I took that graphic novel example and modified it to display several pages of the first American Woodworker issue. I modified the title, styles, and got the cover art working. Once I understood where things belonged, I reworked my example into templates that could be filled with the data I extracted from each issue.

The second immensely helpful resource was the EPUBCheck tool. EPUB may be “basically a zipped up website” but EPUB readers are in no way the developer aids that web browsers are. It only took one round of Apple Books giving a generic error message while EPUBCheck pointed directly to the problem. I ran EPUBCheck after generating new issues before doing anything else from then on.

To generate the EPUBs, I'm still using the static site generator that I started with when making a website full of images: Jekyll. If EPUB really is “just a zipped up website.”, why not build that website in a way that I'm familiar with (Jekyll is how I publish this blog).

Jekyll is likely the wrong tool for this job. It makes fitting the data I extracted from the DVDs into new XML/XHTML files easy. I created very simple YAML files for each issue (simple name: value pairs, one per line, if you're unfamiliar, e.g. title: Vol 1 No 1 March 1985), and a small set of templates that reference the values in those files (e.g. <dc:title>{{ title }}</dc:title> in my package document template). Building the Jekyll site runs each YAML through its correct template and produces a fully rendered file (e.g. <dc:title>Vol 1 No 1 March 1985</dc:title>). Easy!

But there are three elements to each issue that do not need templating, and Jekyll makes each of them awkward in different ways.

The first is the images. When Jekyll writes out the rendered files, it does so in another directory. In order to get the images in the correct location relative to those renders, it has to copy them to that directory. The images are over 99% of the data here, and doubling their consumption is not great. It's also not awful, considering we're talking about six-ish gigabytes of data, and we all have hard drives far larger than that. The power user in me, who has had too much data for his disks in past years, still cringes though.

The second awkwardness is three files that are exactly the same for each issue. It feels silly to me to make copies of these files before rendering (so that Jekyll can copy them, like the images, to the right output directory). But Jekyll offers no built-in way to say, “Please duplicate these same three files to these N other locations during rendering.” To add that ability, I wrote a plugin that dynamically adds virtual files (objects that only exist in memory, while the rendering process is running) for each issue (EPUBStatic.rb). Those virtual files just happen to copy from the same source to different outputs. It's not a wildly complex plugin, but I found the documentation for Jekyll plugins to be pretty thin, and I've used Ruby very little, so this was more searching for examples and modifying them to fit my needs.

And finally, the third awkwardness is the ZIP. After all of the files are copied and rendered, the last stage of creating an EPUB is to bundle them all into a ZIP file. This is another operation that Jekyll doesn't support, so another plugin was needed (ZipEPUB.rb). But because of the way it has to hook into the very end of the build process to ensure that all of the other rendered files exist first, Jekyll doesn't know about its output (the ZIP/EPUB files) at the start of the build process when it's cleaning up files from previous builds that aren't represented in the source directories. Jekyll deletes all of the EPUBs every time. My workaround is to copy the files to Jekyll's cache directory right after creating them. Then I can copy them back from the cache directory to the build directory instead of re-zipping if the files that would have been zipped haven't changed. When it works, this saves me minutes out of a full-collection build. The tradeoff is that I have to store approximately three copies of each EPUB. Indeed, not just the ZIP and the cache copy I created, but also a copy that Jekyll tracks in an opaque .jekyll-metadata file. (Sometimes. As I write this, my .jekyll-metadata has magically shrunk from multiple gigabytes to just a few megabytes. I don't know why.) This is why the instructions in README.md suggest passing the --disable-disk-cache feature. If you're only building everything once, instead of iterating on improvements like I am, it saves up to 40% on disk space.

So, it required significant additions to my Jekyll setup to be able to create EPUBs with it. But also, it worked. I now have an EPUB for each issue. I've read a couple on my iPad. Why am I not sure if EPUB was the right choice?

It's hard to put my finger on exactly where my discomfort with EPUB is rooted. Some of it is that I keep getting this feeling that it was designed less as a way to archive and share books, and more as a way to transact with bookstores. To the extent that this helps authors get paid (i.e. streamlining the process from writing to selling), I'm all for it. To the extent that it helps libraries easily deliver material to their patrons, I celebrate it. But an effect this seems to have is that EPUB readers tend focus on getting you connected to a source and helping you consume your book list. I have a bunch of content I already bought, with no specific goal to read it all. What's the right way to sync it, search it, and link to it for later reference? Maybe I just haven't found the right app to do it yet.

An example of this more core to EPUB, and not just its apps, is how limited the widely-supported metadata is. I mentioned the trouble I had parsing out a publishing date from each issue's title. What I learned later is that there is no standard place to include such information in an EPUB. The date metadata field is specified (as of EPUB 3) as the date the EPUB was created. Despite many questions on various forums, the advice for adding information like this about a source from which the EPUB is derived, is to add generic meta items. No known apps really care about these items, but you can put them there without being out of spec. To me it has the same flavor as Star Wars only meaning whatever the latest remaster is, and forgetting the original releases. It's a vague feeling that, again, might just be me being unfamiliar with the larger EPUB ecosystem.

Search is the thing I'm really going to have to work on. It would be great to not have to keep the raw extracted data once I have generated the EPUB files. But since they're compressed, simple text-search tools like grep don't work. There is zipgrep, but it has some quirks (like only being able to search one file at a time). At the same time, Apple Books doesn't seem to see the text that I put in the files at all. I added it all as alt attributes on the img tags. Do I need a different app, or do I need to put the text somewhere else on the page? The middle option is to keep the text files, but remove the images after EPUB creation. The text is less than 1% of the total storage, so it's not a huge burden.

Wrap up

Images of the covers of American Woodworker magazine issues from the 1980s arranged in a grid, each with a tiny 'NEW' label under it.
EPUBs loaded on my iPad

I now suspect that I have the only EPUB copies of American Woodworker 1985-2010, until someone else runs my scripts against their DVDs to make their own.

Alternatives

I considered whether there was another option to gain access to these back issues more easily. American Woodworker closed up shop just four years after producing these DVDs. But Popular Woodworking now sells a USB thumb drive with these issues, plus the remaining years, plus the back issues of three other magazines, for $180. Or they also sell each issue for $6. I'm skeptical whether these editions would be any more accessible to me. The single issues don't specify a file format, but the thumb drive specifically says, “NOT compatible with iPads, iPhones, tablets or smartphones.” If you have one of these thumb drives, I'd be curious to hear whether the files on it are structured similarly to these DVDs.

Another option was to set up a VM and install Flash on it. I'm sure I could find an old installer for something like Internet Explorer on the VM I keep around to access BMW motorcycle maintenance information. But that's so many hoops to jump through that I doubt I would have actually read any of the magazines.

Reading

Now that I have the content, I have been paging through some of the early bits. I'm enjoying seeing which things have and haven't changed. Table saw crosscut sleds, for instance, have evolved. If you've seen instructions in the past ten years, they recommend T-track, stops, extensions, zero-clearance inserts, and safety windows. These are the entirety of American Woodworker's instructions for building one, in 1985:

This type of a jig is best built from cabinet grade mediun density fiber board. Simply attach strips of wood which are milled to the table slots to the bottom of the piece of fiber board and bolt a stout piece of wood at either end to hold the whole thing together once it is cut in two. —American Woodworker, Vol 1 No 3 September 1985, page 10.

This appears to also be the era when people were choosing between table saws and radial arm saws. That has become a common topic in the last few years again, as many of the radial arms that were purchased then are coming up for sale on the used market now. Being far more familiar with table saw work myself, and knowing its dangers, it was interesting to see it proposed that kickback during ripping on the table saw was far less concerning than kickback during ripping on the radial arm.

What remains to be seen is how much of my design taste will be altered by exposure to these issues. What I'm seeing in the '85 issues is very much a mix of what I think of as traditional and modern styles. The bench in one issue will be a rectangular box with ornate turned or carved legs and crossmembers. The bench in the next will be smooth flowing laminated curves.

Future

Outside of tweaks to improve searching or cataloging, I'm not sure I'll do much more with this code. I'm publishing it for reference, hoping it might help someone else. If you use it to recover another magazine's issues, please let me know! If it helps you get your own EPUB project off the ground, tell me what you created!

Categories: Woodworking Development