Skip to content

Instantly share code, notes, and snippets.

@kimadactyl
Last active August 10, 2021 17:09
Show Gist options
  • Save kimadactyl/452ea4504686ecacb3749ba2605e1451 to your computer and use it in GitHub Desktop.
Save kimadactyl/452ea4504686ecacb3749ba2605e1451 to your computer and use it in GitHub Desktop.
Scripts to migrate an existing HTML site to Jekyll when you can't access the source
# First grab the remote site
# wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://yourdomain.co.uk
# If you have images hosted on another site then do
# wget --mirror --convert-links --adjust-extension --page-requisites --span-hosts --domains mydomain.com,imagehostingcdn.com https://mydomain.com
# Then have a look how many lines to trim from beginning and end
htmls = Dir.glob('/**/*.html')
htmls.each do |file|
# Change numbers to how many lines you want to trim
output = File.readlines(file).drop(450).reverse.drop(72).reverse
File.open(file, 'w') do |f|
output.each do |line|
f.write line
end
end
end
# Add frontmatter to all your HTML files
htmls = Dir.glob('/**/*.html')
htmls.each do |file|
output = File.readlines(file).reverse
output << "---\n\n"
output << "title: \n"
output << "---\n"
output = output.reverse
File.open(file, 'w') do |f|
output.each do |line|
f.write line
end
end
end
# Extract links on external CDNs
htmls = Dir.glob('/**/*.html')
htmls.each do |file|
matches = File.read(file)
.scan(/(http\:\/\/cdn[0-9]\.yourdomain\.co\.uk\/.*\.(jpg|png|gif|jpeg|pdf))"/)
matches.each { |m| puts m[0] }
end
# This isn't perfect but you can then save this to a file and run wget.
# wget -i file-manifest -m
# If you're importing from Dropbox it might help to run the following to clean up file extensions
find . -exec rename -s '?raw=1' '' {} +
find . -exec rename -s '?dl=0' '' {} +
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment