Brady Jiang bradyjiang

1 follower · 0 following

replybot.io
United States

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

bradyjiang / 20190720-bs4.py

Created July 21, 2019 15:52

Solution 2: BeautifulSoup

	from bs4 import BeautifulSoup

	# 20190720, from: https://stackoverflow.com/questions/30565404/remove-all-style-scripts-and-html-tags-from-an-html-pagesoup = BeautifulSoup(str_html)
	for s in soup(["head"]):
	s.decompose()
	cleaned_html = str(soup)

bradyjiang / cleaner.py

Created July 21, 2019 15:47

Solution 1: lxml.html.clean.Cleaner

	from lxml.html.clean import Cleaner

	#to prevent Cleaner to replace html with div, leave page_structure alone: http://stackoverflow.com/questions/15556391/lxml-clean-html-replaces-html-tag-with-div
	cleaner = Cleaner(page_structure=False)
	#according to: http://stackoverflow.com/questions/8554035/remove-all-javascript-tags-and-style-tags-from-html-with-python-and-the-lxml-mod
	#Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the <script> tag; you also want to get rid of things like onclick=function() attributes on other tags.
	cleaner.javascript=True
	cleaner.scripts=True
	#turn this on in the future if necessary
	#cleaner.style=True