Confluence to Obsidian

confluence to obsidian

Overview

I migrated my notes from Confluence to Obsidian using a few scripts. More generically, this converts Confluence to markdown. This blog post describes the general process so others who face this problem have a starting point.

Confluence doesn’t directly support exporting a workspace to markdown. The result of my attempt was only a partial solution.

I prioritized the following:

Preserve any text written and convert it to markdown
Avoid excessive HTML. Try to keep everything plain markdown

The following weren’t converted to markdown:

Folder structure
Images
Attachments
Internal links between notebooks

The result was a massive folder with each page as a markdown file which could be copied and searched.

How it works

Export a Confluence workspace to HTML.
Convert the HTML to markdown using a Pandoc in a PowerShell script.

Due to the the HTML format, not everything can get easily converted. Use a Lua filter to define how Pandoc should convert different HTML elements.

PowerShell script

# ConvetHTMLToMarkdown.ps1

# Set the path to the folder containing the HTML files
$sourceFolder = "./confluence_workspace"

# Set the path to the folder containing generated Markdown files
$destinationFolder = "./markdown"

# Set the path to the lua filter script
$luaFilterPath = "filter_html.lua"

# Get all HTML files in the folder
$htmlFiles = Get-ChildItem -Path $sourceFolder -Filter "*.html" -File

# Loop through each HTML file and convert to Markdown using Pandoc
# foreach ($htmlFile in $htmlFiles[0..10]) {
foreach ($htmlFile in $htmlFiles) {
    $htmlFilePath = $htmlFile.FullName
    $markdownFileName = Join-Path -Path $destinationFolder -ChildPath ([System.IO.Path]::ChangeExtension($htmlFile.Name, "md"))

    $command = "pandoc '$htmlFilePath' -f html -t gfm -L '$luaFilterPath' -s -o '$markdownFileName'"
    Invoke-Expression $command
}

This is in PowerShell but should be similar in bash.
The script sets all of the parameters and file paths
It loops through all of the HTML files and runs the Pandoc command to convert the file to markdown
Convert a few files first to check the results before converting the entire workspace.

Pandoc

pandoc '$htmlFilePath' -f html -t gfm -L '$luaFilterPath' -s -o '$markdownFileName'

Install Pandoc. https://pandoc.org/installing.html
The command converts from HTML to GitHub Flavored Markdown using the Lua filter.
GitHub Flavored Markdown converted the most from Confluence HTML. It was the only format I found which could convert the Confluence HTML tables.

Lua Filter

-- filter_html.lua

function Span(el)
  return el.content
end

function Div(el)
  -- Removes footer
  if el.identifier == "footer" then
    return {}
  end
  -- Removes breadcrumb header
  if el.identifier == "breadcrumb-section" then
    return {}
  end
  -- Removes attachments plugin
  if el.classes:includes("plugin_attachments_container") then
    return {}
  end
  return el.content
end

function Para(el)
  if el.content then
    for _, item in ipairs(el.content) do
      -- Out of order execution, converted links should not be stringified
      if item.t == "RawInline" then
        return el
      end
    end
    return pandoc.utils.stringify(el.content)
  end
end

function Image(el)
  return {}
end

function Link(el)
  -- if el.classes:find("download-all-link", 0) then
  --   return {}
  -- end
  if el.classes:includes("external-link") then
    link_text = pandoc.utils.stringify(el.content)
    link_url = el.target
    markdown_link = string.format("[%s](%s)", link_text, link_url)

    return {pandoc.RawInline("gfm", markdown_link)}
  end
end

-- function Table(el)
--   if el.classes:find("attachments", 0) then
--     return {}
--   end
-- end