Stellent, like Drupal, automatically wraps each page's core content with a lot of structure that is the same page-to-page: HTML <head> and <body> tags, headers, menus, footers, logos, etc. So I whipped up a couple of shell scripts to strip that stuff off of each file, leaving just the core content behind. Here's an English description of the edits performed:
- Files on the public site (www.iac.org):
- Extract the contents of the <title> tag and save them to the new stripped file
- Delete lines 1 through 420
- Delete the line that contains the string </ul></td></tr></table>, and all subsequent lines
- Exception for the files in the www.iac.org forms directory:
- Delete lines 1 through 10, and the final two lines
- Files on the members-only site (members.iac.org):
- Extract the contents of the <title> tag and save them to the new stripped file
- Delete lines 1 through 393
- Delete the line that contains the string </ul></td></tr></table>, and all subsequent lines
- Note: The termination string above may not be on a line by itself, therefore a newline must be inserted just prior to the termination string
Here's the shell script that accomplished this:
#!/bin/bash # DJM, 2012-02-08 # Trim all files and copy results to local dirs # Where the old files are legacy_dir=/var/www/old-site # Define a unique delimiter string delim='!!!!!!!!!!----------!!!!!!!!!!' # The string that marks the end of the HTML content that concerns us endstr='<\/ul><\/td><\/tr><\/table>' # Step 0: Flush any old results rm -fr public; mkdir public # Step 1: public while read f do echo "$f" filename="$legacy_dir/public/$f" target=`echo "$f" | sed 's/\//#/g' | sed 's/ /@/g' | sed 's/^\.//' | sed 's/^/public\//'` grep '<title>' < "$filename" | sed 1q | sed 's/<title>//' | sed 's/<\/title>//' > "$target" sed '1,420d' < "$filename" | sed "/$endstr/,\$d" >> "$target" done < public-file-list echo;echo;echo # Step 2: public/forms while read f do echo "$f" filename="$legacy_dir/public/$f" target=`echo "$f" | sed 's/\//#/g' | sed 's/ /@/g' | sed 's/^\.//' | sed 's/^/public\//'` grep '<title>' < "$filename" | sed 1q | sed 's/<title>//' | sed 's/<\/title>//' > "$target" sed '1,10d' < "$filename" | grep -v '^<\/body>' | grep -v '^<\/html>' >> "$target" done < public-file-list-forms echo;echo;echo # Step 3: members while read f do echo "$f" filename="$legacy_dir/members/$f" target=`echo "$f" | sed 's/\//#/g' | sed 's/ /@/g' | sed 's/^\.//' | sed 's/^/members\//'` grep '<title>' < "$filename" | sed 1q | sed 's/<title>//' | sed 's/<\/title>//' > "$target" sed '1,393d' < "$filename" | sed "s/$endstr/\n$delim/" | sed "/$delim/,\$d" >> "$target" done < members-file-list