univac_history

This list-mail blog has lived several lives, starting as a simple email list, and then moving into a Google Groups list- way back when it was beta. The fact that Google Groups was free was terrific. Not to mention the fact that Google Groups contribute more directly to Google’s dynamic corpus, literally driving Google’s search engine at it’s core…

However, over the years, post after post- this list grew. Google grew. And Google didn’t scale very well along with us. Tons of bugs piled up, and Google engineering never got around to fixing any of our issues- we became 1 tiny amoeba in the Google universe.

And now here it is, the Spectre Event Horizon Group on WordPress! Here’s some geeky reasons why:

  • Open Source Blog Software. Unlike Google’s groups, we can keep the actual software which runs this site- forever! On our own physical servers, even…
  • Open Database Format. Unlike Google’s groups, all WordPress blogs can be EXPORTED as well as imported, in a relatively sane and well documented XML format. If this blog moves somewhere else one day, our XML backups can be easily parsed!
  • Excellent multi-user collaborative publishing tools.
  • Killer editing tools, and easy to add images, etc…
  • Free! (space/usage upgrades at cost if we want/need them)

So, with that stated, it was still a tremendous effort to port the Spectre content from Google Groups to WordPress. After much messy hacking, here’s how it was done, (and some notes about what parts failed). This is not necessarily the order it all occurred in, but it’s the process- and it is here with the intention that others find it useful:

1) Google Groups Export

We needed to get the data *out* of Google in some program-readable format. After *many* unsuccessful attempts to screen-scrape the content from the old site, we eventually realized that all posts to the Google Group were from a single gmail account. We setup the account to be accessible via IMAP, and downloaded the messages using an Apple OSX machine and Mail.app. (!!! In retrospect, it would have been better to use a program which uses standard UNIX .mbox files – the character-encoding for Apple’s mail files was a real problem!). Parsing the Apple .mbox files became incredibly problematic and inconsistent, from message to message.

Eventually, I sucked it up and saved each message as a plain text file, (defaulted to UTF-8, whew!), and continued the parsing below.

2) Parsing the email messages (as text files)!!!

S0, I used Python, which is quick, and very precise for text-processing tasks. Here’s the (very hackish and dirty) source code, again, just posting this as it may help someone else doing a more refined job:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#

"""
Copyright (c) 2008, Isaac (.ike) Levy, For Spectre
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list
of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
Neither the name of the <ORGANIZATION> nor the names of its contributors may be
used to endorse or promote products derived from this software without specific
prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED.
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"""

########
# BUGS #
########

#[ ] date stamp problem

#[x] check for completeness (was rtf issue for 3 image messages)

#[?] url wrap problem ?!

#[x] mime/urlencoded ' and " symbols, ('=92' etc...)

#[x] emails, crediting

#[x] !! contact emails within body
  # Cows as compass as reference?

#########

# NEXT STEPS
# - import GM list
# - export GM blog xml for diffs

# A quick script to cnvert Gmail messages to WordPress files
# Created to aid in converting a Google Group to a WordPress blog

# for each message:
#
# - slice out date
#  + re-format to WordPress date format?
# - slice out author (email address from field)
#  + output author at top of content
# - slice out CONTENT
#  - turn url's into links
#

import os, glob
import re
import codecs
import quopri

#import time
from time import mktime, gmtime, strftime
from datetime import datetime
from calendar import timegm

#dirPath = 'Starred.imapmbox/Messages/'
#dirPath = 'Starred.imapmbox.small/Messages/'
dirPath = 'Starred.converted/'

# defines path to file
##filename = 'Starred.imapmbox/Messages/1062.emlx' # digest
##filename = 'Starred.imapmbox/Messages/1523.emlx' # typical
# TODO make this a directory path        

################################################################################
# ikenote example import file
################################################################################
'''<?xml version="1.0" encoding="UTF-8"?>
<!-- Underwritten by the Mechanical Proofreaders Union of the 63rd Chamber of Sheygyets-Libling, -->
<!-- this is a dotike processed conversion dump of WordPress to Spectre, Sep 23, 2008 -->

<!-- This is a WordPress eXtended RSS file generated by WordPress as an export of your blog. -->
<!-- It contains information about your blog's posts, comments, and categories. -->
<!-- You may use this file to transfer that content from one site to another. -->
<!-- This file is not intended to serve as a complete backup of your blog. -->

<!-- To import this information into a WordPress blog follow these steps. -->
<!-- 1. Log into that blog as an administrator. -->
<!-- 2. Go to Manage: Import in the blog's admin panels. -->
<!-- 3. Choose "WordPress" from the list. -->
<!-- 4. Upload this file using the form provided on that page. -->
<!-- 5. You will first be asked to map the authors in this export file to users -->
<!--    on the blog.  For each author, you may choose to map to an -->
<!--    existing user on the blog or to create a new user -->
<!-- 6. WordPress will then import each of the posts, comments, and categories -->
<!--    contained in this file into your blog -->

<!-- generator="WordPress/MU" created="2008-09-20 18:44"-->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:wp="http://wordpress.org/export/1.0/"
>

<channel>
	<language>en</language>
	<wp:wxr_version>1.0</wp:wxr_version>
	<wp:category><wp:category_nicename>g-archive</wp:category_nicename><wp:category_parent>g-archive</wp:category_parent><wp:cat_name><![CDATA[g-archive]]></wp:cat_name></wp:category>

		<item>
<title>ike import Four</title>
<dc:creator><![CDATA[tutle]]></dc:creator>

<description>Added Description Four</description>
<content:encoded><![CDATA[IKENOTE CONTENT AREA
More Content

Double return carrage and more content

<a href="http://slashdot.org/">http://slashdot.org/</a>

end of post]]></content:encoded>
<wp:post_date>2010-02-01 00:00:00</wp:post_date>
<pubDate>Mon, 22 Sep 2008 17:16:25 +0000</pubDate>
<wp:status>draft</wp:status>
<wp:post_type>post</wp:post_type>
<wp:postmeta>
</wp:postmeta>
	</item>
<item>

'''
################################################################################

sec0 = '''<?xml version="1.0" encoding="UTF-8"?>

<!-- generator="WordPress/MU" created="2008-09-20 18:44"-->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:wp="http://wordpress.org/export/1.0/"
>

<channel>
	<language>en</language>
	<wp:wxr_version>1.0</wp:wxr_version>
	<wp:category><wp:category_nicename>g-archive</wp:category_nicename><wp:category_parent></wp:category_parent><wp:cat_name><![CDATA[g-archive]]></wp:cat_name></wp:category>
'''

sec0b = "<wp:category><wp:category_nicename>g-digest</wp:category_nicename><wp:category_parent></wp:category_parent><wp:cat_name><![CDATA[g-digest]]></wp:cat_name></wp:category>"

sec1 = '''
		<item>
<title>'''

sec2 = '''</title>
<dc:creator><![CDATA[tutle]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[g-archive]]></category>

		<category domain="category" nicename="g-archive"><![CDATA[g-archive]]></category>

<description>'''

sec3 = '''</description>
<content:encoded><![CDATA['''

sec4 = ''']]></content:encoded>
<wp:post_date>'''

sec5 = '''</wp:post_date>
<pubDate>'''

sec6 = '''</pubDate>
<wp:status>publish</wp:status>
<wp:post_type>post</wp:post_type>
<wp:postmeta>
</wp:postmeta>
	</item>
<item>'''

originalAuthor = '''From the archive, originally posted by: '''

# ikenote: the big processing loops for each message
def procFile(x):

  postSubject =''
  postBody = u''
  postDate =''
  pubDate = ''
  postAuthor = ''

  # defines each file
  # TODO make this a for loop to iterate over files in a given directory
  #ikenote works-> ###
  post = open(x, "r")
  #ikenote kindof works -> ##post = codecs.open(x, "r", "utf-8")
  #codecs.open(filename, mode[, encoding[, errors[, buffering]]])
  #print post.read()
  #print '--'

  halves = post.read().split('nn', 1)
  header = halves[0]
  #body = unicode( halves[1], "utf-8" )
  body = halves[1]

  postBody = ''

  # ikenote: sequentially process each line
  for line in header.split('n'):
    #print 'Debug:' + str(line)

    if line.startswith('Subject: '):
      postSubject = line.split(' ', 1)[1]
      #postSubject = postSubject.lower() # ikenote causes strange bug- missing leading 't'?
      if postSubject.startswith('[spectre] '):
          postSubject = postSubject.lstrip('[spectre] ')
      if postSubject.startswith('Re: [spectre] '):
          postSubject = 'Re: ' + postSubject.lstrip('Re: [spectre] ')
      postSubject = postSubject.upper()
      #print postSubject

    elif line.startswith('Date: '):
      postDate = line
      #print postDate
      # msg formatting:
      # Date: Sun, 04 Nov 2007 21:11:02 -0000
      # to
      # <pubDate>Wed, 16 Jul 2008 08:55:01 +0000</pubDate>
      # (thank goodness, WP forgives day abriviation for import!)
      # and
      # <wp:post_date>2006-02-27 00:14:00</wp:post_date>
      ##print postDate[18:22] + '-' + postDate[14:17] + '-' + postDate[11:13]
      #print postDate.split(' ')
      #print postDate
      year = '1900'
      month = '01'
      day = '01'
      time = '00:00:00'

      ## ikenote: correction, date format is as follows now:
      # "Date: May 15, 2007 1:34:17 PM EDT"
      # tokenized:
      """
      Debugdate:Date:   0
      Debugdate:March   1
      Debugdate:5,      2
      Debugdate:2006    3
      Debugdate:8:45:07 4
      Debugdate:PM      5
      Debugdate:EST     6
      """

      splat = postDate.split(' ')

      #for x in splat:
      #print "Debugdate:" + str(x)
      if splat[1] == 'January':
        month = 1
      elif splat[1] == 'Feburary':
        month = 2
      elif splat[1] == 'March':
        month = 3
      elif splat[1] == 'April':
        month = 4
      elif splat[1] == 'May':
        month = 5
      elif splat[1] == 'June':
        month = 6
      elif splat[1] == 'July':
        month = 7
      elif splat[1] == 'August':
        month = 8
      elif splat[1] == 'September':
        month = 9
      elif splat[1] == 'October':
        month = 10
      elif splat[1] == 'November':
        month = 11
      elif splat[1] == 'December':
        month = 12

      day = splat[2].strip(',')

      year = splat[3]

      #elif len(x) == 8 and x[2] == ':':

      time = splat[4]
      #print "Debugdate2 am-pm:" + str(time)
      splats = splat[4].split(':')
      if splat[5] == 'PM' and int(splats[0]) < 12:
        hour = int(splats[0]) + 12
        time = str(hour) + ':' + str(splats[1]) + ':' + str(splats[2])
      else:
        hour = splats[0]

      minute = splats[1]
      second = splats[2]

      postDate = str(year) + '-' + str(month) + '-' + str(day) + ' ' + str(time)
      # format pub date is different? yuck. We want:
      # <pubDate>Wed, 16 Jul 2008 08:55:01 +0000</pubDate>
      #
      # postDate looks like '2005-9-28, 3:42:10'

      birthday = (int(year), int(month), int(day), int(hour), int(minute), int(second))
      bdaytm = timegm(birthday)
      ##print "Debugdate-bday:" + str(bdaytm)
      ##print "Debugdate1:" + str(strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime(bdaytm)))
      ##print "Debugdate2:" + str(strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime()))
      pubDate = str(datetime.strptime(postDate, "%Y-%m-%d %H:%M:%S"))
      ##print "Debugdate3:" + str(pubDate)

      ##pubDate = postDate
      #print postDate

      ## ikenote - sloppy, but just re-declare new date objects as necessary
      pubDate = str(strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime(bdaytm)))

    # ikenote: this is broken!!!!
    # It just uses one author for all posts- text attribution in body.
    elif line.startswith('From:'):
      #print 'Debug:' + str(line)
      line = str(line.split(' ', 1)[1])
      #line = line.strip('<>')
      postAuthor = line.replace('@', ' [at] ').replace('.', ' [dot] ').replace('<', '{ ').replace('>', ' }')
      #print postAuthor
    #else:
    #  postAuthor = str(line) #'spectre'

  if postSubject != '':
    if postSubject.lower().find('digest') >= 0:
      for line in body.split('n'):
        if line.startswith('<?xml version="1.0" encoding="UTF-8"?>'):
          break

        elif line.startswith('=3D=3D'):
          line = '=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D'

        elif line.startswith('http://'):
          line = '<a href="' + line +'">' + line + '</a>'
        elif line.startswith('https://'):
          line = '<a href="' + line +'">' + line + '</a>'

        postBody = postBody + line + 'n'

    else:
      for line in body.split('n'):

        if line.startswith('--~--~---------~--~----~------------~---') or line.startswith('–~–~—'):
          break

        elif line.startswith('=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D='):
          line = '=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D'
          #break

        if line.startswith('<?xml version="1.0" encoding="UTF-8"?>'):
          break

        elif line.startswith('http://'):
          line = '<a href="' + line +'">' + line + '</a>'
        elif line.startswith('https://'):
          line = '<a href="' + line +'">' + line + '</a>'

        #print line

        postBody = postBody + line + 'n'

  #print post.readline()

  #print '--'
  #print post.tell()
  #print post.closed

  #ikenote: experimental scrub email addresses from body
  email_pattern = re.compile("[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+")
  emails = re.findall(email_pattern, postBody)
  #print 'Debug:' + str(emails)
  for eam in emails:
    munge = eam.replace('@', ' [at] ').replace('.', ' [dot] ').replace('<', '{ ').replace('>', ' }')
    #print 'Debug:Munge:' + str(munge)
    postBody = postBody.replace(eam, munge)

  #postBody = quopri.decodestring(str(postBody))

  # ikenote: the big print line
  print '<!-- ikenote="filename = ' + str(infile) + '"-->'
  print sec1 + postSubject + sec2 + postSubject + sec3 + originalAuthor + postAuthor + 'nn' + postBody + sec4 + postDate + sec5 + pubDate + sec6

  #for x in email.iterators.body_line_iterator(mmsg, decode=True):
    #print x
    #print '--'

#for infile in glob.glob( os.path.join(dirPath, '*.emlx') ):
for infile in glob.glob( os.path.join(dirPath, '*.txt') ):
  #print "the file is: " + infile
  print sec0
  procFile(infile)

With that, when the code above is called from a shell as a Python script, the code above just prints everything to stdout- (was easier for debugging/reading the XML output) so the final output can be piped to a file, like so:

$ ./gmail_wordpress.py >> ./export.xml

3) What to Output?!

Since there is no importer for Google Groups data, the script above dumps out WordPress XML. I created an example by making a test blog, a few laurem-ipsum type posts, and looking at an export:

<?xml version="1.0" encoding="UTF-8"?>
<!-- This is a WordPress eXtended RSS file generated by WordPress as an export of your blog. -->
<!-- It contains information about your blog's posts, comments, and categories. -->
<!-- You may use this file to transfer that content from one site to another. -->
<!-- This file is not intended to serve as a complete backup of your blog. -->

<!-- To import this information into a WordPress blog follow these steps. -->
<!-- 1. Log into that blog as an administrator. -->
<!-- 2. Go to Manage: Import in the blog's admin panels. -->
<!-- 3. Choose "WordPress" from the list. -->
<!-- 4. Upload this file using the form provided on that page. -->
<!-- 5. You will first be asked to map the authors in this export file to users -->
<!--    on the blog.  For each author, you may choose to map to an -->
<!--    existing user on the blog or to create a new user -->
<!-- 6. WordPress will then import each of the posts, comments, and categories -->
<!--    contained in this file into your blog -->

<!-- generator="WordPress/MU" created="2008-09-23 17:17"-->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:wp="http://wordpress.org/export/1.0/"
>

<channel>
	<title>Dotiketesting's Weblog</title>
	<link>http://dotiketesting.wordpress.com</link>
	<description>Just another WordPress.com weblog</description>
	<pubDate>Sat, 20 Sep 2008 18:32:38 +0000</pubDate>
	<generator>http://wordpress.org/?v=MU</generator>
	<language>en</language>
	<wp:wxr_version>1.0</wp:wxr_version>
	<wp:base_site_url>http://wordpress.com/</wp:base_site_url>
	<wp:base_blog_url>http://dotiketesting.wordpress.com</wp:base_blog_url>
	<wp:category><wp:category_nicename>g-archive</wp:category_nicename><wp:category_parent></wp:category_parent><wp:cat_name><![CDATA[g-archive]]></wp:cat_name></wp:category>
	<wp:category><wp:category_nicename>uncategorized</wp:category_nicename><wp:category_parent></wp:category_parent><wp:cat_name><![CDATA[Uncategorized]]></wp:cat_name></wp:category>
		<item>
<title>ike post title</title>
<link>http://dotiketesting.wordpress.com/?p=705</link>
<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.wordpress.com/?p=3</guid>
<description></description>
<content:encoded><![CDATA[]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>705</wp:post_id>
<wp:post_date>2008-09-23 17:15:50</wp:post_date>
<wp:post_date_gmt>0000-00-00 00:00:00</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name></wp:post_name>
<wp:status>draft</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>post</wp:post_type>
<wp:post_password></wp:post_password>
<wp:postmeta>
<wp:meta_key>_edit_lock</wp:meta_key>
<wp:meta_value>1221936163</wp:meta_value>
</wp:postmeta>
<wp:postmeta>
<wp:meta_key>_edit_last</wp:meta_key>
<wp:meta_value>5158622</wp:meta_value>
</wp:postmeta>
	</item>
<item>
<title>About</title>
<link>http://dotiketesting.wordpress.com/about/</link>
<pubDate>Sat, 20 Sep 2008 18:32:38 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

<guid isPermaLink="false"></guid>
<description></description>
<content:encoded><![CDATA[This is an example of a WordPress page, you could edit this to put information about yourself or your site so readers know where you are coming from. You can create as many pages like this one or sub-pages as you like and manage all of your content inside of WordPress.]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>2</wp:post_id>
<wp:post_date>2008-09-20 18:32:38</wp:post_date>
<wp:post_date_gmt>2008-09-20 18:32:38</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>about</wp:post_name>
<wp:status>publish</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>page</wp:post_type>
<wp:post_password></wp:post_password>
	</item>
<item>
<title>Hello world!</title>
<link>http://dotiketesting.wordpress.com/2008/09/20/hello-world/</link>
<pubDate>Sat, 20 Sep 2008 18:32:38 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.wordpress.com/2008/09/20/hello-world/</guid>
<description></description>
<content:encoded><![CDATA[Welcome to <a href="http://wordpress.com/">Wordpress.com</a>. This is your first post. Edit or delete it and start blogging!]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>707</wp:post_id>
<wp:post_date>2008-09-20 18:32:38</wp:post_date>
<wp:post_date_gmt>2008-09-20 18:32:38</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>hello-world</wp:post_name>
<wp:status>publish</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>post</wp:post_type>
<wp:post_password></wp:post_password>
<wp:postmeta>
<wp:meta_key>_edit_lock</wp:meta_key>
<wp:meta_value>1222190170</wp:meta_value>
</wp:postmeta>
<wp:postmeta>
<wp:meta_key>_edit_last</wp:meta_key>
<wp:meta_value>5158622</wp:meta_value>
</wp:postmeta>
<wp:comment>
<wp:comment_id>2</wp:comment_id>
<wp:comment_author><![CDATA[Mr WordPress]]></wp:comment_author>
<wp:comment_author_email></wp:comment_author_email>
<wp:comment_author_url>http://wordpress.com/</wp:comment_author_url>
<wp:comment_author_IP>127.0.0.1</wp:comment_author_IP>
<wp:comment_date>2008-09-20 18:32:38</wp:comment_date>
<wp:comment_date_gmt>2008-09-20 18:32:38</wp:comment_date_gmt>
<wp:comment_content><![CDATA[Hi, this is a comment.<br />To delete a comment, just log in, and view the posts' comments, there you will have the option to edit or delete them.]]></wp:comment_content>
<wp:comment_approved>1</wp:comment_approved>
<wp:comment_type></wp:comment_type>
<wp:comment_parent>0</wp:comment_parent>
<wp:comment_user_id>0</wp:comment_user_id>
</wp:comment>
	</item>
<item>
<title>ike post title</title>
<link>http://dotiketesting.wordpress.com/2008/09/20/4-revision/</link>
<pubDate>Sat, 20 Sep 2008 18:43:00 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.wordpress.com/2008/09/20/4-revision/</guid>
<description></description>
<content:encoded><![CDATA[]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>708</wp:post_id>
<wp:post_date>2008-09-20 18:43:00</wp:post_date>
<wp:post_date_gmt>2008-09-20 18:43:00</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>4-revision</wp:post_name>
<wp:status>inherit</wp:status>
<wp:post_parent>706</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>revision</wp:post_type>
<wp:post_password></wp:post_password>
	</item>
<item>
<title>wordpress-12008-09-20testxml.import</title>
<link>http://dotiketesting.wordpress.com/?attachment_id=6</link>
<pubDate>Sat, 20 Sep 2008 18:49:25 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml.import</guid>
<description></description>
<content:encoded><![CDATA[http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml.import]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>6</wp:post_id>
<wp:post_date>2008-09-20 18:49:25</wp:post_date>
<wp:post_date_gmt>2008-09-20 18:49:25</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>wordpress-12008-09-20testxmlimport</wp:post_name>
<wp:status>inherit</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>attachment</wp:post_type>
<wp:post_password></wp:post_password>
<wp:attachment_url>http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml.import</wp:attachment_url>
<wp:postmeta>
<wp:meta_key>_wp_attached_file</wp:meta_key>
<wp:meta_value>/home/wpcom/public_html/wp-content/blogs.dir/957/4919684/files/2008/09/wordpress-12008-09-20testxml.import</wp:meta_value>
</wp:postmeta>
	</item>
<item>
<title>wordpress-12008-09-20testxml1.import</title>
<link>http://dotiketesting.wordpress.com/?attachment_id=7</link>
<pubDate>Sat, 20 Sep 2008 18:49:40 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml1.import</guid>
<description></description>
<content:encoded><![CDATA[http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml1.import]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>7</wp:post_id>
<wp:post_date>2008-09-20 18:49:40</wp:post_date>
<wp:post_date_gmt>2008-09-20 18:49:40</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>wordpress-12008-09-20testxml1import</wp:post_name>
<wp:status>inherit</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>attachment</wp:post_type>
<wp:post_password></wp:post_password>
<wp:attachment_url>http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml1.import</wp:attachment_url>
<wp:postmeta>
<wp:meta_key>_wp_attached_file</wp:meta_key>
<wp:meta_value>/home/wpcom/public_html/wp-content/blogs.dir/957/4919684/files/2008/09/wordpress-12008-09-20testxml1.import</wp:meta_value>
</wp:postmeta>
	</item>
<item>
<title>wordpress-12008-09-20testxml2.import</title>
<link>http://dotiketesting.wordpress.com/?attachment_id=8</link>
<pubDate>Sat, 20 Sep 2008 18:51:14 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml2.import</guid>
<description></description>
<content:encoded><![CDATA[http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml2.import]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>8</wp:post_id>
<wp:post_date>2008-09-20 18:51:14</wp:post_date>
<wp:post_date_gmt>2008-09-20 18:51:14</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>wordpress-12008-09-20testxml2import</wp:post_name>
<wp:status>inherit</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>attachment</wp:post_type>
<wp:post_password></wp:post_password>
<wp:attachment_url>http://dotiketesting.files.wordpress.com/2008/09/wordpress-12008-09-20testxml2.import</wp:attachment_url>
<wp:postmeta>
<wp:meta_key>_wp_attached_file</wp:meta_key>
<wp:meta_value>/home/wpcom/public_html/wp-content/blogs.dir/957/4919684/files/2008/09/wordpress-12008-09-20testxml2.import</wp:meta_value>
</wp:postmeta>
	</item>
<item>
<title>ike post title</title>
<link>http://dotiketesting.wordpress.com/?p=706</link>
<pubDate>Mon, 22 Sep 2008 17:16:25 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[g-archive]]></category>

		<category domain="category" nicename="g-archive"><![CDATA[g-archive]]></category>

<guid isPermaLink="false">http://dotiketesting.wordpress.com/?p=4</guid>
<description></description>
<content:encoded><![CDATA[This is the post title.

http://slashdot.org/

More to be found later]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>706</wp:post_id>
<wp:post_date>2008-09-22 17:16:25</wp:post_date>
<wp:post_date_gmt>2008-09-22 17:16:25</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>ike-post-title</wp:post_name>
<wp:status>draft</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>post</wp:post_type>
<wp:post_password></wp:post_password>
<wp:postmeta>
<wp:meta_key>_edit_lock</wp:meta_key>
<wp:meta_value>1222190227</wp:meta_value>
</wp:postmeta>
<wp:postmeta>
<wp:meta_key>_edit_last</wp:meta_key>
<wp:meta_value>5158622</wp:meta_value>
</wp:postmeta>
	</item>
<item>
<title>ike post title</title>
<link>http://dotiketesting.wordpress.com/2008/09/23/706-revision/</link>
<pubDate>Tue, 23 Sep 2008 17:15:50 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.wordpress.com/2008/09/23/706-revision/</guid>
<description></description>
<content:encoded><![CDATA[This is the post title.

http://slashdot.org/

More to be found later]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>709</wp:post_id>
<wp:post_date>2008-09-23 17:15:50</wp:post_date>
<wp:post_date_gmt>2008-09-23 17:15:50</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>706-revision</wp:post_name>
<wp:status>inherit</wp:status>
<wp:post_parent>706</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>revision</wp:post_type>
<wp:post_password></wp:post_password>
	</item>
<item>
<title>ike post title</title>
<link>http://dotiketesting.wordpress.com/2008/09/23/706-revision-2/</link>
<pubDate>Tue, 23 Sep 2008 17:16:25 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.wordpress.com/2008/09/23/706-revision-2/</guid>
<description></description>
<content:encoded><![CDATA[This is the post title.

http://slashdot.org/

More to be found later]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>710</wp:post_id>
<wp:post_date>2008-09-23 17:16:25</wp:post_date>
<wp:post_date_gmt>2008-09-23 17:16:25</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>706-revision-2</wp:post_name>
<wp:status>inherit</wp:status>
<wp:post_parent>706</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>revision</wp:post_type>
<wp:post_password></wp:post_password>
	</item>
<item>
<title>ike post title</title>
<link>http://dotiketesting.wordpress.com/2008/09/23/706-revision-3/</link>
<pubDate>Tue, 23 Sep 2008 17:16:46 +0000</pubDate>
<dc:creator><![CDATA[dotiketesting]]></dc:creator>

		<category><![CDATA[Uncategorized]]></category>

		<category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>

<guid isPermaLink="false">http://dotiketesting.wordpress.com/2008/09/23/706-revision-3/</guid>
<description></description>
<content:encoded><![CDATA[This is the post title.

http://slashdot.org/

More to be found later]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>711</wp:post_id>
<wp:post_date>2008-09-23 17:16:46</wp:post_date>
<wp:post_date_gmt>2008-09-23 17:16:46</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>706-revision-3</wp:post_name>
<wp:status>inherit</wp:status>
<wp:post_parent>706</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>revision</wp:post_type>
<wp:post_password></wp:post_password>
	</item>
</channel>
</rss>

Pretty sane once you get working with it. One could get wild with XML parsers to build the output, but if you look at the Python in step 2 above, I chose an even less structured route and just glued the text strings together, rinse and repeat…

4) Import into WordPress (a test blog first!)

Then, once all is said and done, I made a test WordPress blog for import testing. A few important things to know when importing:

  • WordPress posts must belong to an author, the Import dialog will ask you which user you want to assign posts to.
  • The author for posts must already be associated with your blog.
  • These previous two points are important if you are importing large amounts of data, (spectre pushed the 15mb limit upon import…); if you choose the wrong author, you have to change *each post manually*.

Aside from an upload misfire here and there, the Import feature works great!

And that’s more or less it- aside from a ton of cursing and message parsing hell. There are definitely more elegant (and reusable) ways to do this, but moving a big group from Google Groups to WordPress must not be a common occurrence- (I sure as hell never plan to do it again!)

MISC EXTRAS

Here’s a short python script to convert Spectre email posts to WordPress-ready posts, (paste the output in the ‘html’ pane, save, and continue editing):

#!/usr/bin/python
# -*- coding: utf-8 -*-
#

"""
Copyright (c) 2008, Isaac (.ike) Levy, For Spectre
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list
of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
Neither the name of the <ORGANIZATION> nor the names of its contributors may be
used to endorse or promote products derived from this software without specific
prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED.
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"""

import re
import sys

usage = '''
USAGE:
This script creates html urls and scrubs emails in plain text files, e.g.:
     scrub2html.py [file.txt]
     cat [file.txt] | scrub2html.py

The script prints to stdout, to write to file, pipe output to a file:
     scrub2html.py [file.txt] > myFile.txt

--
This script assumes it is being passed a text file with the following:
    - Plain-text URLS in the format:

http://someurl.foo


https://someurl.bar

    - Email addresses in the text file

This script does 2 things:
    1) Converts the urls to html links, e.g.:
    <a href="http://someurl.foo">http://someurl.foo</a>

    2) Obfuscates the email addresses, e.g.:
    tom@asdf.com becomes tom [at] asdf [dot] com
    -and-
    <tim@asdf.com> becomes { tim [at] asdf [dot] com }

This script was originally created to convert plain-text emails to WordPress
ready posts.  Originally created for the Spectre Event Horizon, 2008.

'''

myInfile = ''

#ikenote, FILE HANDLING
if len(sys.argv) > 1:
    myInfile = open(sys.argv[1]).readlines()
    # for usage with file as command-line arguement:
    # ./this_script.py somefile.txt
elif sys.stdin.isatty() == False:
    myInfile = sys.stdin.readlines()
    # for usage from stdin, pipes, etc.. example:
    # cat somefile.txt | ./this_script.py

#ikenote, TEXT PROCESSING
if myInfile == '':
   print usage,
   #ikenote, bail out early if no input,
   sys.exit()
else:
    for bigLine in myInfile:

        #ikenote: url-ize http(s) links,
        if bigLine.strip().startswith('http://'):
            bigLine = '<a href="' + bigLine.strip() +'">' + bigLine.strip() + '</a>n'
        elif bigLine.strip().startswith('https://'):
            bigLine = '<a href="' + bigLine.strip() +'">' + bigLine.strip() + '</a>n'
        #ikenote: crudely obfuscate email addresses,
        elif re.search("[w-][w-.]+@[w-][w-.]+[a-zA-Z]{1,7}", bigLine) != None:
            bigLine = bigLine.replace('@', ' [at] ').replace('.', ' [dot] ').replace('<', '{ ').replace('>', ' }')

        print bigLine,

With that, I hope these notes help someone else who’s attempting to do the same thing… Best of luck!

Rocket- .ike { ike [at] lesmuug [dot] org }

One Response to “Tech:GG->WP”


  1. [...] How Spectre moved from Google Groups to wordpress Includes step by step python code, etc. Very thorough and excellent. (tags: wordpress import export google.groups portable data python) This entry was posted in Links. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL. « links for 2009-09-29 [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 42 other followers