Migrating from WordPress to Astro with OpenAI
I finally got around to move this blog from Wordpress to more modern solution. This has been a long time in making. I ended up using Astro as the framework. With little bit of help from bolt.new, the framework for the blog was ready in no time.
The main issue was with migrating the existing content from the Wordpress to Astro. The amount of content was pretty small and I only wanted to migrate the blog posts themselves. I wanted to keep similar urls so that I could map the the old ones to the new ones with HTTP 301.
I had already previously made an XML dump of of the Wordpress site. Main task was to go through that XML, extract the blog posts, ignore drafts, convert them to Markdown and create the files.
For the one-off migration I needed just a quick and dirty solution, so I ended up asking for help from ChatGPT. I uploaded the XML file there and asked it to generate me a script for migration. ChatGPT figured out the XML structure and picked the rights parts. I decided to outsource the conversion of the post content to Markdown to ChatGPT via the OpenAI API. This worked out pretty well.
The old url structure on my sites was
https://myblog.com/{category}/{post-slug}
On the new site I wanted to have
https://mynewblog.com/blog/{post-slug}
The code picked the slug from the XML file and created the Markdown content using the slug as filename.
Sample script for the conversion. It aint pretty but it got the job done. The only external dedendency is the OpenAI Python SDK.
uv add openai
import os
import xml.etree.ElementTree as ET
from openai import OpenAI
# Configure OpenAI client
client = OpenAI()
# Function to clean and format filenames
def get_filename_from_link(link):
if link:
return link.split("/")[-2] + ".md"
return "unknown.md"
# Function to convert HTML content to Markdown using OpenAI
def convert_to_markdown(title, content):
instructions = """
We are migrating a Wordpress blog to new platform. As part of the migration, we need to convert the existing blog posts from HTML to Markdown format.
Here are some instructions to help you with the conversion:
- Never mention that you are AI model or bot.
- The input files may contain [code]..[/code] blocks. These are not supported on Markdown. For markdown code blocks, use triple backticks (```) before and after the code block.
- If possible, try to identify the language of the code block and add it after the triple backticks. For example, ```python
- When the content is displayed, a title is shown for the user separately. The content you generate should not include a first heading
- **Only include Markdown headings (`#`, `##`, etc.) if the HTML content explicitly contains heading tags like `<h1>`, `<h2>`, etc. Do not infer headings from bold text, lists, or structure.**
"""
prompt = f"Convert the following Wordpress content to Markdown\n{content}"
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "developer", "content": instructions},
{"role": "user", "content": prompt}
],
temperature=0.0,
)
return response.choices[0].message.content.strip()
except Exception as e:
print(f"Error during conversion: {e}")
return content
# Parse the XML file
def process_wordpress_xml(xml_file):
tree = ET.parse(xml_file)
root = tree.getroot()
# Namespaces in the XML
ns = {
'content': 'http://purl.org/rss/1.0/modules/content/',
'wp': 'http://wordpress.org/export/1.2/'
}
# Output directory for Markdown files
output_dir = "markdown_posts"
os.makedirs(output_dir, exist_ok=True)
# Iterate through each item (blog post)
for item in root.findall("channel/item"):
# Check if post is a draft
status = item.find("wp:status", ns).text
if status == "draft":
continue
title = item.find("title").text or "Untitled"
print(f"Processing: {title}")
link = item.find("link").text
pub_date = item.find("wp:post_date", ns).text or "0000-00-00"
pub_date = pub_date.split(" ")[0]
content_encoded = item.find("content:encoded", ns).text or ""
# Extract categories
categories = []
for category in item.findall("category[@domain='category']"):
if category.get('nicename'):
categories.append(category.get('nicename').strip())
# Remove duplicates
categories = list(set(categories))
# Convert HTML content to Markdown
markdown_content = convert_to_markdown(title, content_encoded)
# markdown_content = "dummy"
# Format the Markdown content with frontmatter
markdown_output = f"""---
layout: ../../layouts/BlogPostLayout.astro
title: {title}
date: {pub_date}
tags: {categories}
excerpt: ''
---
{markdown_content}
"""
# Determine output filename
filename = get_filename_from_link(link)
filepath = os.path.join(output_dir, filename)
# Save to a Markdown file
with open(filepath, "w", encoding="utf-8") as file:
file.write(markdown_output)
print(f"Saved: {filepath}")
# Input XML file
xml_file = "WordPress-export.xml"
# Process the file
process_wordpress_xml(xml_file)
I moved also the old blog domain to Cloudflare and used the Cloudflare redirect to map the old urls to the new ones. This way I could keep the old urls working.