Tuesday, 30 March 2010


Paper is good stuff. Hurray for paper! On the other hand, it weighs quite a lot and takes up space. Also a big heap of paper is hard to manage in the long term, unless you are excellent at filing. Plus can I take all my files with me into a manuscript library? No, I cannot. But I can take this: [stock photo]

It's my little 500Gb hard drive! (If you want one you can get a commemorative "Michael Jackson is dead" version for less money than the plain variety.) I am quite pleased with how I'm getting on with making this change, so I thought I'd write a blog post about it, partly as an exercise in self-satisfaction, but also just in case anyone reading wants to make a similar transition. Of course the stuff I work on imposes a few odd constraints on how I work.

1. Managing references.
A long time ago I got fed up of retyping the same bibliographical references, and made a file called "Megabibliography" which was just an amalgamation of all the bibliographies I had ever written, for reference purposes. They were in variable styles, of course, but it worked as a quickish way of getting references. I switched to Endnote in 2003 or 4 when the whole thing got out of hand, and I had a several-thousand-item bibliography I had to manage for work, and also needed to generate an annual bibliography in a very odd style. (Importing all the data gave me RSI.)
The things which Endnote does which I really need are: ability to tag documents with keywords (e.g. "Calendars", "ASE 2006", or "ASE check", etc); easily customisable export styles, allowing me to produce my own style sheets for obscure journals, and XML material for easy import into other things; and it quickly generates citations in a way which I can edit. The last is vital: most reference software is designed for scientists, and is based on the idea that your computer does all the work for you. But in the humanities a reference manager is essentially a time-saving device. It won't be able to cope with certain oddities, like nineteenth-century German monographs which are also issues of a journal, etc, and it's important that you can easily edit what comes out. Beware of Zotero and Mendeley. They may be OK if you're just starting out, and they are free, but they do things automatically for you which you don't necessarily want done, and are very inflexible. Also I am intensely suspicious of the cloud. It's just not true that I am rarely offline; sometimes I'm on a train, or in Corpus Christi College, where there is no wifi at all. Occasionally broadband connections go down. And I'm not a fan of their social-networking-style features. The other day Mendeley told me that the most popular author in the Humanities is Foucault, and suggested that I might like to read Foucault. As many as 15 Mendeley users have Foucault in their bibliographies, apparently. Stupid software. That sort of thing is a bit like the Ask the Audience option on Who Wants To Be A Millionaire, in that it's most useful for the questions about soap operas. I can quite happily believe I'll discover a fun YouTube clip or a good TV show through Web 2.0/Cloud-style stuff; not so sure I'm going to find help in my work on late Anglo-Saxon liturgy.
Be cautious also about the option to import references automatically into your library. One of those applications' website says something like "Avoid typing errors by importing references from web databases" but what it means is "Propagate other people's errors by importing references from web databases". Library catalogues are packed with mistakes, because a lot of libraries mistakenly treat cataloguing as a data-entry task and pay casual staff not much money to do it. Import material as a time saver, but you'll need to edit it before moving on.

2. Recycling my "Photocopied Papers" folders.
All my folders of photocopied papers have now been recycled into rat bedding or taken away by the council, and I gave the lever arch files to my dad. The key to this was a small sheet-feeding scanner, and DVD box sets. The DVD box sets are vital because this is dull work in the same way that feeding CDs into iTunes is dull. I recommend the complete Seinfeld; it's light-hearted and you've probably seen them all before anyway. I bought a little scanner which does one sheet at a time quite quickly and feeds it through for you. Plus points: it's cheap; it's very portable; it came with excellent software. It was probably my favourite thing about 2009. (It wasn't a great year.) But if I were seriously rich I would have spent about 300 quid on this superfast non-portable sheet-feeding scanner which can do double-sided. Really it's the software which made the whole thing possible. Feeding paper into a scanner is fine because you don't need to think about it. Then, once I had done a whole article, I simply highlighted all the images from it in Presto PageManager, which came with the scanner, right-clicked to stack them, and then dragged them onto a PDF icon to save as a multi-page PDF. Whilst doing this I was beginning to feed the next article through (unless I had got too distracted by the antics of that Kramer). So if you have a scanner but no software then look into something which will painlessly turn images into pdfs for you.

3. Metamanuscripts.
I work on medieval manuscripts. I pay a lot of attention to form as well as content. I have not yet managed to find a way of taking notes on a computer which is better for my work than paper and pencil, because I need sometimes to make drawings of odd shapes of letters, or of decoration. I'm not an artist by a long long way but I can copy things OKish. I've tried using electronic pen input, with a graphics tablet, but it's not as controllable as a real pencil in terms of shading and such like; and some libraries let you take your own photographs, but not all do, and besides sometimes you need to copy something in order to look at it properly, or to add notes on things you can't expect to come out in a photograph, like words scratched into gold on initials. So I have a lot of manuscripts about manuscripts. I digitised them in the same way as the articles of step 2, but I haven't recycled them just in case, since they are unique; the originals sit in a couple of large boxes full of suspension files. I have added the PDFs to my MS-image files, and now if I'm sitting in a library looking at a manuscript and it reminds me of another manuscript I can immediately call up all my notes on it and any images I have just like that. This is an immensely useful tool for me. I continue to take notes with a pencil on paper, and then scan them in when I get home.
NB The graphics tablet might not have worked for me for manuscript notes, but it was great for the RSI caused by step 1. Hurray! Also these gloves, without which I cannot now type; I recommend them heartily if you have back-of-the-hand style RSI like me, rather than bad wrists.

4. Photocopying costs 10p a sheet at CUL, and 20p at the BL (where you are not allowed to copy more than a single page at a time, e.g. no double-page spreads).
I ended up in a situation where I would carefully (and expensively) photocopy an article, take it home, scan it, and then put the paper straight into the recycling. My current project is an attempt to cut out the middle phase of this, thus saving money, time, and the planet. I have acquired a very cheap, portable, USB-powered flat-bed scanner, and I am experimenting with scanning things from books. For example I have some multi-author books with one or two seriously useful articles in, and I am trying out digitising these so that I can take them with me to libraries. So far it's working well. So if possible in future I will borrow books or journals and copy the article I want straight on to my computer, and the portable scanner can live in my suitcase when I travel. Now I know that the UL photocopiers will now, theoretically, scan things for you and e-mail them to you for 8p (!) an image, but when I tried it it charged me for a whole article and only sent me the first page. Meaning I had to go back to the library and find the journal again, etc. I might give this another go sometime though, for things I can't borrow.

5. Image and text
I didn't originally OCR my scans. This was because at work I used to have access to the full copy of Adobe Acrobat, which would OCR pdfs for you, but very very slowly, without much accuracy, and with a large augmentation to the storage size of the file, so I wrote it off as a concept. However, I have recently got hold of ABBY Finereader 10 and it is very fine indeed. It even OCRs in Latin. If I run pdfs through it which I scanned in from photocopies of open books, so that they have two pages side by side on a landscape sheet of A4, it automatically splits these into two separate pages and rotates them before doing the OCR. It's also possible to scan straight into Finereader; it does a lot of image processing though, and I think it's quicker to give it pdfs which it doesn't feel the need to tidy up so much.

6. Managing PDFs.
The point of having a computer-read text is to be able to search it. Of course the OCR isn't perfect (though it's very good) and I'm not wasting time making it perfect, so searches have to be reasonably fuzzy. It's at this point that I'm still experimenting. I can set Windows' own indexing facility to include pdfs (using the pdf iFilter which comes with Windows 7 at least) and then I can look for, say, "Harley" in all pdfs in a folder, and it returns a list of matches. However, it doesn't give me context -- it would be useful to have the several words before and after, especially since I am probably most interested in the number that follows Harley. Or I can use the "search all in folder" option in Adobe Reader 9, which I think will be useful. Ideally I'd like to produce a concordance file, but concordance software seems on the whole to need .txt files. If I could find a good piece of software that allowed me to insert my own index tags in pdfs and then produce an index across many pdf files I would have a go at making manuscript indexes for myself. These would be very very useful. I have manually indexed important articles in the past, and it's tedious work, but wonderful to have later.

7. Backing up.
I had a hard drive fail only the other day, but my complex backup system meant that I didn't lose any data. Get at least one large external hard drive and set things backing up when you go to bed. I use BounceBack to do system backups and Allway Sync to do basic synchronisations of data. (BounceBack came free on an external hard drive I bought; it's sometimes worth paying attention to the software that comes bundled with hardware.) Allway Sync lets you choose whether or not things deleted from the source folder get deleted from the backup folder, so I back my external hard drive up twice, once in each way to different hard drives. If I accidentally delete something I can retrieve it from the comprehensive backup, but if I were to lose the whole drive I would get my new copy from the more accurate backup. Of course if I were really paranoid I would have an off-site backup, but I have too much data for an online one.

8. What I'd like to try.
I think I'm going to try taking photos of books and see whether Finereader can OCR them for me. The problem will be camera shake. But ABBYY have a special application for this, Fotoreader, which is what gave me the idea it might be worth trying.

9. Maybe in about 10 years' time.
My graphics tablet has amazing handwriting recognition. I found I could write more or less normally and it would OCR what I had written. So the next thing will be something I can run my handwritten MS notes through which will OCR them and make them searchable. Not possible now I think, but technology might well head that way. So at present I can't search my manuscript notes, which would be a useful thing to be able to do. I have two workarounds for this: the first is Onenote. I switched to Onenote a couple of years ago and now I absolutely depend on it for all my work and home organisation and everything, and will never be able to leave it. It replaces those lever arch files I used to have which organised notes for a particular project. When I find something in a manuscript which I don't need at the moment but think I might want to find again one day I make a note in Onenote. For example the other day I saw an interesting form of quire signature which reminded me of something in a Trinity MS, and because I made a quick note in Onenote I can find it immediately by searching on "quire" and tell you now that it was in an 8th-century Merovingian manuscript in the Vatican. I'm not going to go into the excellences of Onenote now; but I find it very useful for many things, and it's now the core of how I work. My second workaround is old school: I have a little address book, the sort with alphabetical tabs but no actual text printed in it, and I copy interesting letterforms into it. One day I'll work out how to migrate it to my computer, but at the moment I just carry that around. It's useful to consult but easy to leave it home accidentally.

10. If I were a proper geek.
There are instructables and such online for making your own book scanner, from complex ones to more basic but clever designs. It would be a lot quicker, and a bit better for the book, than using a flat-bed scanner or photocopier, and quicker and more accurate than the basic camera method. But I don't have the oomph to make one, and even if I did it would be a bit expensive. You can get a proper professional set up for about 1500 dollars, plus the cost of a good camera, and I think more and more libraries are investing in them. Then they digitise books which are in demand for undergraduates. I have no idea what the copyright implications of that are. (For myself I'm not digitising anything I'm not allowed to photocopy, so I don't think it's an issue for me.) I saw the scanner at Stanford when I went out, one of the ones which works for google books, but it's huge, the size of a room, and we've all seen how patchy its results are.

Of course some parts of this have taken me quite a bit of time, but I am prone to intense but short-lived enthusiasms, and got most of it done while watching TV I was probably going to watch anyway. Now it seems easier to me to scan something in than to file it. And the wierd thing is that my pdf library is only just over 9Gb in size, e.g. tiny. (My manuscript notes library is 155Gb but then it has a lot of images in.)

