Python XML DOM Minidom
xml.dom is a module for parsing XML.
This parser utilizes a minimal implementation of the DOM (Document Object Model), and as such offers a DOM-like API.
Contents
Usage
Parse a file
from xml.dom import minidom document = minidom.parse(filename)
If the XML file uses namespaces, it can be easier to disable that feature in the parser.
from xml.dom import minidom, expatbuilder document = expatbuilder.parse(filename, False)
Traverse all nodes
def recurse_print(node): if node.nodeType == minidom.Node.TEXT_NODE: print(node.data) else: for child in node.childNodes: recurse_print(child) recurse_print(document)
Scrape HTML tables
def recurse_text(node): buffer = "" for child in node.childNodes: if child.nodeType == minidom.Node.TEXT_NODE: buffer += child.data else: buffer += recurse_text(child) return buffer for table in document.getElementsByTagName("table"): for row in table.getElementsByTagName("tr"): data.append([]) for header_cell in row.getElementsByTagName("th"): data[0].append(recurse_text(header_cell)) for cell in row.getElementsByTagName("td"): data[-1].append(recurse_text(cell))
Scrub the DOM
It can be useful to scrub the DOM of unhelpful or useless components.
To remove attributes, try:
if node.hasAttribute("hidden"): node.removeAttribute("hidden")
To remove nodes, try:
for child in node.childNodes: if child.hasAttribute("hidden"): node.removeChild(child) child.unlink()
To replace nodes, as with comments, try:
replacement = document.createComment("scrubbed useless node") # alternatively, createTextNode or createElement for child in node.childNodes: if child.hasAttribute("hidden"): node.replaceChild(child, replacement)
See also
Python xml.dom.minidom module documentation