Tuesday, June 24, 2003

About a week ago I bought one of those CDs from ebay that have 1800 novels on it. Well, suffice it to say that their definition of 'novel' was slightly different from the mainstream. After culling out all the census reports, cia sourcebooks, and other crap documents from the past 15 years, and deleting the multitude of mathematical proofs, listings of PI, powers of 1 through 10, and square/cube roots of 1-100, I was left with about 1200 'novels'. Of those, I deleted all but 250 or so that were simply not to my taste. (By the way, ALL of these documents were from the Gutenburg project, and were available for free. I wonder if the fellow that sold it to me paid them their 20%?)

So, after distilling 1800 to 250, I was left with 250 text files which varied from 30k to 3mb. I further sorted them into folders by author, and for those without author attribution, sorted by subject or just into 'misc'.

Then I wrote a short shell script (linux) to work some conversion magic on them. Here it is :

for h in `ls -d */.`;do
cd $h
for j in *.txt; do
mkdir $k
split -C 150k $j $k/$k-
rm $j
cd $k
for i in * ; do
sed "s/Untitled/$i/g" /pre.htm >/pre1.htm
cat /pre1.htm $i /post.htm > $i.htm
rm $i
echo finished with $i
cd ..
echo done in $h
cd ..

To use this, you should be in a folder containing many other folders, each of which has one or several .txt files in it. The script willl not follow subdirectories, it only works one level deep. .TXT only, by the way. You also need to have two html files, one containing the top part of a web-page, just the HTML open, HEAD open and close, BODY open and PRE open tags, call it pre.htm. The other one is the bottom html stuff, called post.htm. The bottom one should just be an closing PRE tag, as well as the closing BODY and HTML tags. I put these in the root, but you can edit the script and put them wherever you wish.

The script lists all of the directories in the current directory (will not work on txt files in the current folder - move them to a misc folder as needed).
For each directory, it moves into it, then lists the TXT files.
For each TXT file, it creates a directory named the same as the TXT file (without the TXT suffix)
Then it runs the 'split' utility, to chop up the txt file into several smaller files, with the largest file being 150k (just a value I picked). It then copies the resulting files into the new folder named for the text file, naming them OLDFILENAME-aa and so on, where the aa increments through aa, ab, ac, etc.
Then it deletes the original TXT file and changes to the folder named for the txt file.
For each file section, it reads the filename and edits the pre.htm file, saving the edited file as pre1.htm. Then it concatenates the pre1.htm, the txt file itself, and the post.htm, saving the resulting file as an htm file. This converts a txt file into an htm file with a custom header containing the filename. The source txtfile with the aa suffix is then deleted.
Then it backs up a directory level and goes to the next txt file.
Once all the txt files in that folder are done, it backs up and moves to the next folder.

All in all, it worked very nicely, and left me with a structure like this:


etc. So each large text file was split into several smaller htm files in their own folder, named after the text file. Works very nicely.

Hopefully some of this long and rambling post will be useful to someone!