Parsing¶

The HTML class parses a string of HTML and provides methods to query the DOM for specific elements.

init¶

Creates an HTML object from a str or bytes.

from minestrone import HTML
html = HTML("""
<html>
  <head>
    <title>The Dormouse's Story</title>
  </head>
  <body>
    <h1>The Dormouse's Story</h1>

    <ul>
      <li><a href="http://example.com/elsie" class="sister" id="elsie">Elsie</a></li>
      <li><a href="http://example.com/lacie" class="sister" id="lacie">Lacie</a></li>
    </ul>
  </body>
</html>
""")

If closing tags are missing, then they will be added as needed to make the HTML valid.

from minestrone import HTML
assert str(HTML("<span>dormouse")) == "<span>dormouse</span>"

parser¶

Three parsers are available in minestrone and they all have different trade-offs. By default, the built-in, pure Python html.parser is used. lxml can be used for faster parsing speed. html5lib is another option to ensure a valid HTML5 document.

Note

lxml and html5lib are not installed with minestrone by default and must be specifically installed.

poetry add minestrone[lxml] or pip install minestrone[lxml]
poetry add minestrone[html5] or pip install minestrone[html5]

Note

BeautifulSoup has a summary table of the three parsers. There is also a more detailed breakdown of the differences between the parsers.

Parser.HTML¶

from minestrone import HTML, Parser
assert str(HTML("<span>dormouse"), parser=Parser.HTML) == "<span>dormouse</span>"

Parser.LXML¶

from minestrone import HTML, Parser
assert str(HTML("<span>dormouse"), parser=Parser.LXML) == "<html><body><span>dormouse</span></body></html>"

Parser.HTML5¶

from minestrone import HTML, Parser
assert str(HTML("<span>dormouse"), parser=Parser.HTML5) == "<html><head></head><body><span>dormouse</span></body></html>"

encoding¶

Beautiful Soup attempts to decipher the encoding of the HTML string, however it isn’t always correct. An encoding can be passed along if necessary.

from minestrone import HTML, Parser
html_bytes = b"<h1>\xed\xe5\xec\xf9</h1>"

assert str(HTML(html_bytes)) == "<h1>翴檛</h1>"
assert HTML(html_bytes).encoding == "big5"

assert str(HTML(html_bytes), encoding="iso-8859-8") == "<h1>םולש</h1>"
assert HTML(html_bytes).encoding == "iso-8859-8"

prettify¶

Returns a prettified version of the HTML.

html = HTML("""
<html>
<head>
<title>The Dormouse's Story</title>
</head>
<body>
<h1>The Dormouse's Story</h1>

<ul>
<li><a href="http://example.com/elsie" class="sister" id="elsie">Elsie</a></li>
<li><a href="http://example.com/lacie" class="sister" id="lacie">Lacie</a></li>
</ul>
</body>
</html>
""")

assert html.prettify() == """<html>
  <head>
    <title>The Dormouse's Story</title>
  </head>
  <body>
    <h1>The Dormouse's Story</h1>
    <ul>
      <li>
        <a href="http://example.com/elsie" class="sister" id="elsie">Elsie</a>
      </li>
      <li>
        <a href="http://example.com/lacie" class="sister" id="lacie">Lacie</a>
      </li>
    </ul>
  </body>
</html>
"""

str¶

Returns the HTML object as a string.

from minestrone import HTML
html = HTML("""
<html>
  <head>
    <title>The Dormouse's Story</title>
  </head>
  <body>
    <h1>The Dormouse's Story</h1>

    <ul>
      <li><a href="http://example.com/elsie" class="sister" id="elsie">Elsie</a></li>
      <li><a href="http://example.com/lacie" class="sister" id="lacie">Lacie</a></li>
    </ul>
  </body>
</html>
""")

assert str(html) == """<html>
<head>
<title>The Dormouse's Story</title>
</head>
<body>
<h1>The Dormouse's Story</h1>
<ul>
<li><a href="http://example.com/elsie" class="sister" id="elsie">Elsie</a></li>
<li><a href="http://example.com/lacie" class="sister" id="lacie">Lacie</a></li>
</ul>
</body>
</html>"""

Note

Rendering the HTML into a string will remove preceding spaces.

Parsing¶

__init__¶

parser¶

Parser.HTML¶

Parser.LXML¶

Parser.HTML5¶

encoding¶

prettify¶

__str__¶

init¶

str¶