To get the stripped core content into Drupal, I wrote a PHP script that uses the Drupal API. It is invoked using the drush command, as follows:
drush -r $D7 -l www.iac.org scr inhale.php
Here's the script itself:
<?php // // DJM, 2012-02-08 // // PHP script that takes a list of filenames as input, and converts those files // into Legacy nodes on the www.iac.org web site // // The first line of each file is assumed to be the HTML <title> attribute, and is // stored in the node's title field. // The remaining lines are stored in the node's body. // // The filename itself is the legacy URL with pound signs (#) in place of slashes, // and at signs (@) in place of spaces. This script replaces slashes and spaces, // and stores the result in the field_old_url field. // // The field_status field is always set to 'Not Started' $home_dir = "/home/djmolny/legacy-import/"; while (( $filename = readline("")) != FALSE) { print 'filename="' . $filename . '"' . "\n"; $f = fopen($home_dir . $filename, 'r'); if ($f == FALSE) { exit(1); } $title = trim(fgets($f)); print 'title="' . $title . "\"\n"; $body = fread($f, 1024*1024); // 1MB limit is arbitrary, but should suffice $old_url = str_replace("#", "/", str_replace("@", " ", $filename)); $old_url = str_replace("public//", "http://www.iac.org/", $old_url); $old_url = str_replace("members//", "http://members.iac.org/", $old_url); print 'old_url="' . $old_url . "\"\n"; $node = new stdClass(); $node->type = 'legacy_page'; node_object_prepare($node); $node->title = $title; $node->language = LANGUAGE_NONE; $node->body[$node->language][0]['value'] = $body; $node->body[$node->language][0]['summary'] = text_summary($body); $node->body[$node->language][0]['format'] = 'full_html'; $node->field_old_url[$node->language][0]['value'] = $old_url; $node->field_status[$node->language][0]['value'] = 'Not Started'; node_save($node); print("Done!\n\n\n"); // !!! } ?>
Note: Drupal rejected numerous files because they contained special symbols that are not part of the UTF-8 character set, such as "½ loop", "360º roll", or "Fédération". I edited each of these files manually, replacing the symbols with their HTML equivalents (½, °, and é, respectively.) I thought about scripting this process, but since the import is a one-time exercise I decided it wasn't worth the effort. However I'm documenting the problem in case it crops up somewhere down the road.