The BeautifulSoup function prettify() will format your HTML nicely for human readability. The problem is that prettify uses only a single space for indenting which still leaves the HTML a little hard to read. First I tried to see if the output formatter parameter would help, but there is no internal information on how deep the indent should be because the strings are not indented yet.
Formatting HTML for Humans
def indent(str):
print('LS:', len(str) - len(str.lstrip(' ')), str)
input('---')
return str
tree_soup = BeautifulSoup(raw_html)
pretty_html = tree_soup.prettify(formatter=indent)
So I converted the prettified html to a stringIO object so I could parse each line. Then I counted the leading spaces on each line and did some simple math to get a four space indent.
tree_soup = BeautifulSoup(raw_html)
pretty_html = tree_soup.prettify()
htmlio = StringIO(pretty_html)
beautiful_html = ''
for line in htmlio.readlines():
count = len(line) - len(line.lstrip(' ')) # count leading spaces
beautiful_html += (count * 4 - count) * ' ' + line
The ‘beautiful_html’ variable now contains the improved prettified HTML code!