MinDocDigPro

From Minn-StF Wiki
Revision as of 06:01, 16 September 2009 by Dd-b (talk | contribs) (Give permission for CC-BY-SA for Minicon publication contributions.)
Jump to navigation Jump to search

MinDocDigPro is The Minnstf Document Digitization Project. This project has two major goals:

  1. Preserve MNstf documents via digitization.
  2. Make MNstf documents widely available on the web.

There are two major challenges:

  1. Actually doing all the scanning
  2. Securing permission from copyright holders who contributed to the documents

And one minor challenge: Putting the documents up on the web in a reasonably organized and pleasing way.

Converting paper to computer files

Scanning

Scans should be high enough resolution so that the smallest text is easily readable and artwork does not lose detail. As a rough guideline, use at least 300dpi. Or use whatever the highest resolution your scanner supports is as long as that does not make things excessively slow.

Scanning in color is a good idea, even with black and white content. This is because a lot of old printing technology has very poor contrast in blue, but much better in red and green. So you can get a better result by working with the red channel than with a greyscale scan where the three have been averaged.

Also make sure that your scanner is not saving images as JPEGs, since the lossy compression is very inappropriate for text and line drawings (the majority of our content). Any other image format should be fine.

Processing

Here is a complete set of recommendations for how to take raw scans and make them into a set of documents for the web. You could easily consider doing all of this to be too much work for the payoff. Pick the subset that makes you happy. Bonus if the result of your work can easily be used to do the remainder of what is suggested here.

There are, more or less, four useful quality/detail levels that you can end up with:

Raw scans
Advantage: A person seeing these has the most confidence that she is seeing the document as it originally looked.
Disadvantage: Very large file sizes.
Cleaned-up high-quality scans
Advantage: Generally looks better than a raw scan.
Disadvantages: Still fairly large files. Some detail may be inadvertently lost.
Scans reduced to the smallest readable size
Advantage: Small file size while still showing all layout.
Disadvantages: Harder to read, artwork is washed out.
Transcribed/OCR'd text
Advantages: Very small file sizes, can be searched, indexed by search engines, read by blind people, etc.
Disadvantage: Nearly all formatting is lost. Relatively large amount of work to produce.

Raw scans

By "raw", I mean that the scans have not been rotated, had any colors altered, etc. However, they can (and should) be cropped to remove extra borders due to your scanner being bigger that the document. It is also nice to replace any black areas outside the scanned page that remain with flat black to reduce the file size.

Cleaned-up high-quality scans

To produce these, I first rotate the image to straighten it out. Usually my scans are off-square by ~0.5 degrees. This sounds like a very small amount, but it is easily visible.

Then, assuming the document is not in color (two-tone or greyscale), I take the best color channel (generally red or green) and throw away the others. Defining the background color as "white" and the forground color as "black", I make all very light pixels white and all very dark pixels black. The goal of the first (white) part here is to avoid storing the grain of the paper, flecks of dust, and bleed through from the other side of the page. The second part makes the text look better while also making the file smaller. If the document is two-tone (i.e. no greyscale), I push this fairly hard, but I leave some gradient between black and white so that the edges of characters are not too sharp. If the document has greyscale art or photos, I just do enough to remove the noise. (Sometimes I even get obsessive and do different parts of the same page different ways.)

If there is a block of black (or similar) that extends all the way to the edge of the page in the original, I'll make sure this is duplicated in the image. Usually the scanning adds a border or chops a bit off in this case.

If the document was two-tone, but not black and white (e.g. yellow paper with bluish text), I convert back to these colors.

In the end, this version should have a much smaller file size than the raw scans, while also looking a lot better.

Scans reduced to the smallest readable size

To produce these, I reduce the size of the scans and convert them to 1-bit color (i.e. black and white, but it could also be yellow and blue at the same file size if you save it correctly).

To do this, I find the smallest text in the document (or the smallest text that deserves being saved; sometimes preserving every detail of advertisements gets silly) and see how small I can shrink it before it is unreadable when converted to 1-bit color. I ruthlessly destroy greyscale artwork in the process. People who want this should get the higher quality version. (Well, sometimes I apply a different threshold to the text and the art in order to get as good a quality as can be had, but it's not gonna be great.)

Transcribed/OCR'd text

placeholder

File Formats

Scans of book-type documents

PDF seems to be here to stay, at least for this decade or so. It is (nearly) universally readable with very few problems. It is probably the best format to offer on the web.

The images that make up a PDF are fairly easy to extract with no loss of quality if later someone wants to work on them or convert the document to a different format. Nevertheless, for long term archival, it may be useful to explicitly store the images as separate files so people stumbling on them in 30 years don't need to have a PDF reader. For really long term archival, it may be useful to store the images files in portable anymap (PNM) format, a trivial-to-read well-documented uncompressed file type. For extremely long term archival, please engrave on platinum tablets.

Scans of single-page documents

For these (flyers, perhaps), I'd be inclined to just offer the image itself.

Transcribed text

Plain ASCII text is the most universally readable format. It is also, generally, very easy to produce. I discourage trying to reproduce the format of each page with spaces and/or tabs. This often makes the document nonsensical to anything (person or program) that reads linearly through the file. If people want the formatting, they should go to the scans, I'd say.

As a slight variant, you could use plain Unicode text. This allows for things like long dashes (—) rather than approximations such as double hyphens (--). Encoded in UTF-8 or UTF-16, it is exactly the same as plain ASCII except for the non-ASCII characters. Please avoid saving in any extended-ASCII format such as Windows-1252 (or even ISO 8859-1), since to be interpreted correctly later, the set used needs to be explicitly stated or guessed heuristically. Failure to do this generates all sorts of garbage, such as accented vowels when you meant a closing double quote. If you don't know how to tell what text format you are using, please see the Internet or an expert.

With substantially more work, you could produce an HTML version, which would allow for formatting without breaking up the structure of the text. I don't feel this is worth the effort, but have at it if it amuses you.

With even more substantial work, you could produce a PDF version with actual text in it instead of images of text that approximates the look of the original document. Might be a fun challenge, but not really practical.

Copyright

Most people who have contributed to MNstf publications probably did not intend to hold copyright on their contributions. However, some certainly did and we need to be careful about this. The Minnstf board in 2008 came up with clear guidelines on how to handle copyright for Mnstf (including Minicon) publications:

  • Images credited to an artist require us to ask the artist for permission to republish (including web posting)
  • Text credited to someone, ditto.
  • Anything not credited is assumed to be property of Minn-stf. We will, in general, put this up on the web with some sort of permissive license.

I (Matt Strait) have found some corner cases and decided to rule on them unilaterally until and unless someone complains. To whit:

  • I will take the "note from the chair" or similar to automatically be MNstf property even if directly credited to someone. I feel that we are on firm ground assuming ownership here in a way that we are not for things like GoH bios.
  • Text explaining where and when some event is or other similar trivial text I consider to be automatically MNstf property even if attributed to a department head. Text explaining a department's policies, procedures, etc. I also consider in this category as long as it is pretty dry. Highly creative text that nevertheless fulfills the same role should probably be checked on.

What we'd like

In decreasing order of preference, we'd like copyright holders to:

  1. Grant Minn-stf all rights to their works. This is best because then the work is Mnstf's and we can do whatever we want with it.
  2. Grant Minn-stf the right to make their works available via some sort of permissive license, such as those from Creative Commons. In this case, we need to know which sorts of restrictions the copyright holder is interested in, such as:
    • Requiring that the work cannot be used commercially
    • Requiring that the work cannot be modified
    • Requiring that attribution always be given when the work is shown
  3. Grant Minn-stf the right to display their works on the web while retaining full copyright restrictions.

And, orthogonally to that:

  1. Grant these rights to all of their works, past and future
  2. Grant these rights to all of their currently existing works
  3. Grant these rights to the one work that we just digitized

When contacting copyright holders, the ideal thing is to give them all of these options and see what the best we can get is. However, use your judgment. It's, for instance, not always a good idea to send someone a really long e-mail with lots of complicated copyright questions in it. We'd much rather get granted just web-publishing rights on a single document than never get a reply at all.

Copyright answers

List copyright holders' responses in alphabetical order by last name. Check this list before sending any queries about the document you are working on. Notes:

  1. Some people here may not actually have any of their works in any Mnstf publications, but have given us blanket permission just in case.
  2. Please distinguish between the case of a copyright holder answering a question (e.g. "can we distribute your stuff under a Creative Commons license?") with "no" and the case where they did not answer the question at all (possibly because they weren't asked).
  3. If relevant, give the approximate date that the person made their statement. It is not generally safe to assume that they apply to future works unless that was stated.
  • Jennifer "Seven" Anderson: Grants us all rights for submitted works past and future.
  • Karen Cooper (late 2008): Ok with web posting. No answer about CC licensing.
  • DDB (late 2008): Ok with web posting. Answer pending about CC licensing. Late 2009: Okay with CC-CY-SA for stuff my stuff in Minicon publications.
  • Ken Fletcher (late 2008): Ok with web posting. Answer pending about CC licensing.
  • Deb Geisler (late 2008): Any Creative Commons license is ok, "no worries".
  • Jeanne Gomoll (Wiscon 2009): Grants MNstf the rights to "anything else of [hers that] we run across" so long as we preserve attribution.
  • Kathy (Marschall) Grantham (13 May 2009): "fine to post any old Minicon programs with my art".
  • Eric Heideman (31 July 2009): We are free to republish his works as long as he is also free to republish them.
  • Carol Kennedy (late 2008): Grants us all rights.
  • Greg Ketter (late 2008): Grants us all rights.
  • James Kuehl (late 2008): Ok with web posting. No answer about CC licensing.
  • Jason Malgren (late 2008): Grants us all rights.
  • Sue Mason (late 2008): For the Minicon 38 chapbook, Creative Commons is fine, as long as people aren't using her work for profit.
  • Dave Romm (14 July 2009): Grants "Minicon and/or MN-StF the rights to use any of my work for Minicon and/or MN-StF purposes"; "A Creative Commons license would be okay", but it must require attribution and not allow commercial use.
  • Laramie Sasseville (late 2008): Ok with web posting, ok with CC license. Would appreciate a link to dreamspell.net.
  • Matthew Strait: Grants us all rights for submitted works past and future.
  • Geri Sullivan (late 2008): Ok with web posting, ok with CC license.

Web posting

To get a document on the web, you have to be some sort of Mnstf webmaster. Currently:

  • Matt and Dorf can edit the Minicon 42-45 web sites as well as the document repository pages for Minicons 1-29.
  • Kevin and Laurel have super powers and can edit any Minnstf page, including the above and the Minicon 30-41 web sites.

Currently here's where things are:

The main Minicon page says, before its link to the page for all older Minicons, "We didn't have websites for Minicons prior to Minicon 30 or at least we can't seem to find the archives right now." It is possible and reasonable that Minicon 29 (1994) might have had a website. It is even vaguely possible that there was a web page for Minicon 28 (1993), although it would have had to be set up before CERN made the WWW officially free technology, or for Minicon 27 (1992), although it would have had to have been hosted at CERN or on one of the very first webservers in the US, or for Minicon 26 (1991), although Tim Berners-Lee would have had to have written it.

Minicons before 30 might also have had things analogous to websites for the time, such as Gopher sites or collections of documents available via FTP or on BBSes. This is true for all Minicons starting with Minicon 3, before which no packet-switched network existed (the first two nodes of ARPANET being connected on 21 November 1969).

Anyway, the point of all of this is to say that this project is also interested in any electronic-only publications or pseudo-publications that might turn up from Minicons before 30.

Work reserved

Note here what you intend to work on. This prevents duplication of effort.

  • Matt is working on scanning the Minicon 24 program book.

Work completed

Say whether (1) the document is digitized (2) the document has cleared copyright hurdles and can be put on the web (3) the document is actually up on the web now.

Minicon program books

Here's a table for an overview, followed by details.

# know of a copy scanned processed copyright ok posted
1 ?
2 No
3 Yes
4 Yes Yes Yes Partial Partial
5 Yes Yes Yes Partial Partial
6 No
7 No
8 Yes
9 Yes
10 Yes
11
12
13
15
16 Yes
17
14 Yes Yes Yes Partial Partial
19
18
20 Yes Yes Yes Partial Partial
21 Yes Yes Yes Partial Partial
22
23
24 Yes Yes
25
26 Yes
27
28
29
30
31
32 Yes
33 Yes N/A N/A Yes Yes
34
35
36 Yes
37 Yes N/A N/A Yes Yes
38 Yes Yes Yes
39 Yes Yes Yes
40 Yes Yes Yes
41 Yes Yes Yes
42 Yes N/A N/A Yes Yes
43 Yes N/A N/A Yes Yes
44 Yes N/A N/A Yes Yes

1-13

  • A flyer advertising Minicon 1 has been digitized, transcribed and posted on the web. This may be the closest thing to a program book. We don't know of any other documents associated with Minicon 1, although it's possible there were others. It was a 4.5 hour con.
  • Minicon 2 probably had a program book since this was a full-weekend convention, but who knows? For now, a flyer advertising it is the closest thing we know of. It has been digitized and transcribed. The transcription has been posted to the web. We need to check with Jim Odbert about his art.
  • The Minicon 3 program book hasn't been scanned, but here's what it needs by way of copyright checks:
    • Tom Foster for the cover art
  • The Minicon 4 program book is digitized, transcribed, and the text-version is posted on the web. To post complete scans, we need to copyright check art by:
    • Jim Odbert
    • Jack Gaughan
    • Jim Schumeister
  • The Minicon 5 program book is digitized and posted on the web (although not yet linked to).
    • The cover art is by Tom Foster and needs to be copyright checked. Until then, it is on the web with this removed.
  • The Minicon 6 program book is not in the archives. Do you have a copy?
  • The Minicon 7 program book is not in the archives. Do you have a copy?
  • The Minicon 8 program book is not scanned; here are the people who we would need consent from to post it on the web in its entirety:
    • Kelly Freas (art)
    • Jim Odbert (art)
  • The Minicon 9 program book is not scanned; here are the people who we would need consent from to post it on the web in its entirety:
    • Waller (? going from the signature) (art)
  • The Minicon 10 program book is not scanned; here are the people who we would need consent from to post it on the web in its entirety:
    • Gordon Dickson [a challenge]
    • Denny Lien

15-17, 14, 19, 18

  • The Minicon 16 program book is not scanned, but here's a list of people who would need to consent for a full web posting:
    • RayAllard [who might have a space in his name, but that's not how it appears]
    • J.J. Mars (art)
    • Lee Pelton (bio)
    • Del Monie (? reading the signature) (art)
    • Jim Young (bio)
    • Dave Wixon (bio)
    • Frank Stodolka (bio)
    • Fred Haskell (bio)
  • The Minicon 14 program book is digitized and posted on the web (although not yet linked to).
    • We do not have sufficient rights to allow Creative Commons licensing of the book in its entirety [DETAILS NEEDED]

20-26

  • The Minicon 20 program book is digitized and an incomplete version is posted on the web (although not yet linked to). Copyright answers are needed for:
    • James P Hogan GoH text attributed to Gerri Balter and Herman Schouten
    • Gafia text attributed to Jeanne Mealy
    • Stu Shiffman art on page 21
    • Foote the Hermit art on page 10
    • "Exercizes you can do at Minicon" by Matthew B Tepper
    • Kara Dalkey and Jerry Stearns on each other as toastmasters
    • "Patronizing the Arts" by Kara Dalkey
    • "A Modest Proposal" by Marianne Hageman
    • Yike. We're never going to find all of those people. Maybe we should just post the cover and programming schedule...
  • The Minicon 21 program book is digitized and an incomplete version is posted on the web. We need copyright answers for:
    • "Thrilling Wombat Stories" on page 13ff by Terry A Garey
    • "I Have Always Counted on the Consuite of Strangers" by David Charles Cummer, page 18
    • Diane Duane text by Pamela Dean on page 28
    • Terri Windling text by Will Shetterly on page 30
    • John M Ford text on page 30 by Emma Bull
    • PC Hodgell text on page 31 by Eleanor Arnason
    • Phyllis Eisenstein text on page 31 by Scott Imes (hrm...)
    • Art on page 7 by Steve Fastner
    • Art on page 13 by Kara Dalkey
    • Art on page 17 by Reed Waller
    • Art on page 35 by Foote the Hermit
    • Art on page 37 by Jim Odbert
    • Art on back cover by Rich Larson
    • see comment on Minicon 20 book above...
  • Minicons 22-23 go here
  • The Minicon 24 program book is scanned, but not yet processed.
  • The Minicon 26 program book is not scanned, but here's the list of people that would still need to give their consent for a full web posting:
    • Fred Levy Haskell (photo)
    • Jeff Schalles (bio)
    • Stu Shiffman (art)

27-33

  • The Minicon 32 program book is not scanned, but here's a list of contributers that would still need to be checked before a full web posting:
    • Rob Berry
    • Thomas Juntunen
    • Loren Botner
    • Phyllis Eisenstein
    • Jerry Stearns
    • Greg Johnson
    • Kathy Routliffe
    • Tappan King and/or Beth Meacham
    • Jason Parker
  • The Minicon 33 program book (or souvenir book, really) is on the web. However, the ads are overlayed with "F.P.O.", the color cover is rendered in black and white, and it is generally low quality.

34-44

  • The Minicon 36 program book is not scanned, and it is absurdly large, but here's the list of copyright checks that it would need:
    • Judie Cilcain (art and text)
    • Todd Cameron Hamilton
    • Katya Reimann (art and text)
    • Glenn Tenhoff
    • Arthur Thompson (ATom) [that is verbatim how the name appears in the credits]
    • Charles Urbach
    • Kip Williams
    • Ken MacLeod
    • David Owen-Cruise
    • Jo Walton
    • Neil Rest
    • Leslie Fish
    • Irene Raun
    • Avram Grumer
    • Laurel Winter
    • Graydon Saunders
    • Alison Scott
    • Bruce Schneier
    • James Nicol
    • Bill Higgins
    • Rachael Lininger
    • Fred A Levy Haskell
    • Gordon Dickson [going to be tough]
    • Phyllis Eisenstein
    • Mary Kay Kare
    • John M Ford [going to be tough]
    • Ben Bova
    • Joel Rosenberg
    • Dave Wixon
    • Peter Hentges
  • The Minicon 37 program book is on the web.
  • The Minicon 38 program book is digitized, but held up by copyright.
  • The Minicon 39 program book is digitized, but held up by copyright.
  • The Minicon 40 program book is digitized, but held up by copyright.
  • The Minicon 41 program book is digitized, but held up by copyright.
  • The Minicon 42 program book is on the web.
  • The Minicon 43 program book is on the web.
  • The Minicon 44 program book is on the web.

Minicon pocket programs

  • The Minicon 33 pocket program is on the web, but:
    • The outer pages are in an oddly scrambled order, no doubt due to how it was printed (the inner pages are in reading order).
    • The CONvergence ad is overlayed with "F.P.O.", which is not how it appears in print.
  • No pocket programs from Minicons 34-37 are on the web
  • The Minicon 38 pocket program is digitized, but has art that needs a copyright query.
  • The Minicon 39 pocket program is digitized and ready to be posted on the web.
  • The Minicon 43 pocket program, minus Wayne Barlowe's art, is on the web.
  • The Minicon 44 pocket program is on the web.

Minicon progress reports

  • The Minicon 31 PR1 and PR2 are on the web, although only in text format and not as mailed.
  • Five issues of Minicon 32's Minicon Monthly, which I guess filled the PR niche, are on the web. They imply the existence of at least one more issue which is not on the web.
  • No Minicon 33 PRs are on the web
  • The Minicon 34 PR2 and PR3 are on the web, although in text format and not as mailed, and most of the images are broken.
  • The Minicon 35 PR1 and PR2 are on the web. PR2 is in 9 separate PDFs; it would be nice to merge them.
  • The Minicon 36 PR is on the web. Also the Minicon 36 flyer, which seems rather PRish.
  • The Minicon 37 PR1 and PR2 are on the web.
  • The Minicon 38 PR2 and PR3 are on the web.
  • The Minicon 39 PR2 is on the web. The pages are in print-and-fold order; it would be nice to have them in a view-it-on-the-web order.
  • No Minicon 40 PRs are on the web.
  • No Minicon 41 PRs are on the web.
  • The Minicon 42 PR is not on the web.
  • The Minicon 43 PR1 is on the web, PR2 is not.
  • The Minicon 44 PR1 is on the web, PR2 is not.

Bozo Bus Tribune (Minicon at-con newsletter)

  • All 4 issues of the Minicon 43 Bozo Bus Tribune are on the web
  • All 4 issues of the Minicon 44 Medallion Hunt Bulletin (with Bozo Bug Tribune) are on the web

Minicon chapbooks, etc.

  • The Minicon 38 Sue Mason chapbook is on the web.
  • The Minicon 39 Deb Geisler chapbook is on the web.

Rune

None so far.