Stripping the Fluff

Stellent, like Drupal, automatically wraps each page's core content with a lot of structure that is the same page-to-page: HTML <head> and <body> tags, headers, menus, footers, logos, etc.  So I whipped up a couple of shell scripts to strip that stuff off of each file, leaving just the core content behind. Here's an English description of the edits performed:

  • Files on the public site (www.iac.org):
    • Extract the contents of the <title> tag and save them to the new stripped file
    • Delete lines 1 through 420
    • Delete the line that contains the string </ul></td></tr></table>, and all subsequent lines
  • Exception for the files in the www.iac.org forms directory:
    • Delete lines 1 through 10, and the final two lines
  • Files on the members-only site (members.iac.org):
    • Extract the contents of the <title> tag and save them to the new stripped file
    • Delete lines 1 through 393
    • Delete the line that contains the string </ul></td></tr></table>, and all subsequent lines
      • Note: The termination string above may not be on a line by itself, therefore a newline must be inserted just prior to the termination string

Here's the shell script that accomplished this:

#!/bin/bash
# DJM, 2012-02-08
# Trim all files and copy results to local dirs

# Where the old files are
legacy_dir=/var/www/old-site

# Define a unique delimiter string
delim='!!!!!!!!!!----------!!!!!!!!!!'

# The string that marks the end of the HTML content that concerns us
endstr='<\/ul><\/td><\/tr><\/table>'

# Step 0: Flush any old results
rm -fr public; mkdir public

# Step 1: public
while read f
do
  echo "$f"
  filename="$legacy_dir/public/$f"

  target=`echo "$f" | sed 's/\//#/g' | sed 's/ /@/g' | sed 's/^\.//' | sed 's/^/public\//'`
  grep '<title>' < "$filename" | sed 1q | sed 's/<title>//' | sed 's/<\/title>//' > "$target"

  sed '1,420d' < "$filename" | sed "/$endstr/,\$d" >> "$target"
done < public-file-list

echo;echo;echo

# Step 2: public/forms
while read f
do
  echo "$f"
  filename="$legacy_dir/public/$f"

  target=`echo "$f" | sed 's/\//#/g' | sed 's/ /@/g' | sed 's/^\.//' | sed 's/^/public\//'`
  grep '<title>' < "$filename" | sed 1q | sed 's/<title>//' | sed 's/<\/title>//' > "$target"

  sed '1,10d' < "$filename" | grep -v '^<\/body>' | grep -v '^<\/html>' >> "$target"
done < public-file-list-forms

echo;echo;echo

# Step 3: members
while read f
do
  echo "$f"
  filename="$legacy_dir/members/$f"

  target=`echo "$f" | sed 's/\//#/g' | sed 's/ /@/g' | sed 's/^\.//' | sed 's/^/members\//'`
  grep '<title>' < "$filename" | sed 1q | sed 's/<title>//' | sed 's/<\/title>//' > "$target"

  sed '1,393d' < "$filename" | sed "s/$endstr/\n$delim/" | sed "/$delim/,\$d" >> "$target"
done < members-file-list