= Python XML DOM Minidom = '''`xml.dom`''' is a module for parsing XML. This parser utilizes a minimal implementation of the '''DOM''' ('''D'''ocument '''O'''bject '''M'''odel), and as such offers a DOM-like API. <> ---- == Usage == === Parse a file === {{{ from xml.dom import minidom document = minidom.parse(filename) }}} If the XML file uses namespaces, it can be easier to disable that feature in the parser. {{{ from xml.dom import minidom, expatbuilder document = expatbuilder.parse(filename, False) }}} === Traverse all nodes === {{{ def recurse_print(node): if node.nodeType == minidom.Node.TEXT_NODE: print(node.data) else: for child in node.childNodes: recurse_print(child) recurse_print(document) }}} === Scrape HTML tables === {{{ def recurse_text(node): buffer = "" for child in node.childNodes: if child.nodeType == minidom.Node.TEXT_NODE: buffer += child.data else: buffer += recurse_text(child) return buffer for table in document.getElementsByTagName("table"): for row in table.getElementsByTagName("tr"): data.append([]) for header_cell in row.getElementsByTagName("th"): data[0].append(recurse_text(header_cell)) for cell in row.getElementsByTagName("td"): data[-1].append(recurse_text(cell)) }}} === Scrub the DOM === It can be useful to scrub the DOM of unhelpful or useless components. To remove attributes, try: {{{ if node.hasAttribute("hidden"): node.removeAttribute("hidden") }}} To remove nodes, try: {{{ for child in node.childNodes: if child.hasAttribute("hidden"): node.removeChild(child) child.unlink() }}} To replace nodes, as with comments, try: {{{ replacement = document.createComment("scrubbed useless node") # alternatively, createTextNode or createElement for child in node.childNodes: if child.hasAttribute("hidden"): node.replaceChild(child, replacement) }}} ---- == See also == [[https://docs.python.org/3/library/xml.dom.minidom.html|Python xml.dom.minidom module documentation]] ---- CategoryRicottone