Differences between revisions 3 and 9 (spanning 6 versions)

Python XML DOM Minidom

xml.dom is a module for parsing XML.

This parser utilizes a minimal implementation of the DOM (Document Object Model), and as such offer a DOM-like API.

Contents

Python XML DOM Minidom
1. Usage
2. See also

Usage

Parse a file

from xml.dom import minidom
document = minidom.parse(filename)

If the XML file uses namespaces, it can be easier to disable that feature in the parser.

from xml.dom import minidom, expatbuilder
document = expatbuilder.parse(filename, False)

Traverse all nodes

def recurse_print(node):
    if node.nodeType == minidom.Node.TEXT_NODE:
        print(node.data)
    else:
        for child in node.childNodes:
            recurse_print(child)

recurse_print(document)

Scrape HTML tables

def recurse_text(node):
    buffer = ""
    for child in node.childNodes:
        if child.nodeType == minidom.Node.TEXT_NODE:
            buffer += child.data
        else:
            buffer += recurse_text(child)
    return buffer

for table in document.getElementsByTagName("table"):
    for row in table.getElementsByTagName("tr"):
        data.append([])
        for header_cell in row.getElementsByTagName("th"):
            data[0].append(recurse_text(header_cell))
        for cell in row.getElementsByTagName("td"):
            data[-1].append(recurse_text(cell))

Scrub the DOM

It can be useful to scrub the DOM of unhelpful or useless components.

To remove attributes, try:

if node.hasAttribute("hidden"):
    node.removeAttribute("hidden")

To remove nodes, try:

for child in node.childNodes:
    if child.hasAttribute("hidden"):
        node.removeChild(child)
        child.unlink()

To replace nodes, as with comments, try:

replacement = document.createComment("scrubbed useless node")
# alternatively, createTextNode or createElement
for child in node.childNodes:
    if child.hasAttribute("hidden"):
        node.replaceChild(child, replacement)

-  ⇤ ← Revision 3 as of 2022-05-10 18:39:41 → 
  Size: 1893
  Editor: DominicRicottone
  Comment:
+   ← Revision 9 as of 2023-04-11 14:36:55 → ⇥
  Size: 2287
  Editor: DominicRicottone
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
+'''`xml.dom`''' is a module for parsing XML.

This parser utilizes a minimal implementation of the '''DOM''' ('''D'''ocument '''O'''bject '''M'''odel), and as such offer a DOM-like API.
-Line 9:
+Line 13:
-=== Parsing a file ===
+=== Parse a file ===
-Line 21:
+Line 27:
-}}}


=== Scrape HTML tables ===

{{{
for table in document.getElementsByTagName("table"):
    for row in table.getElementsByTagName("tr"):
        if row.firstChild is not None and row.firstChild.nodeName is not None and row.firstChild.nodeName=="th":
            for header in row.childNodes:
                data[0].append(header.nodeValue if header.nodeValue is not None else "")
        else:
            data.append([])
            for cell in row.childNodes:
                data[-1].append(cell.nodeValue if cell.nodeValue is not None else "")
-Line 55:
+Line 46:
-=== Scrubbing the DOM ===
+=== Scrape HTML tables ===

{{{
def recurse_text(node):
    buffer = ""
    for child in node.childNodes:
        if child.nodeType == minidom.Node.TEXT_NODE:
            buffer += child.data
        else:
            buffer += recurse_text(child)
    return buffer

for table in document.getElementsByTagName("table"):
    for row in table.getElementsByTagName("tr"):
        data.append([])
        for header_cell in row.getElementsByTagName("th"):
            data[0].append(recurse_text(header_cell))
        for cell in row.getElementsByTagName("td"):
            data[-1].append(recurse_text(cell))
}}}



=== Scrub the DOM ===
-Line 85:
+Line 99:
+----



== See also ==

[[https://docs.python.org/3/library/xml.dom.minidom.html|Python xml.dom.minidom module documentation]]

Diff for "Python/XmlDomMinidom"