|
Size: 1885
Comment:
|
Size: 2287
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 1: | Line 1: |
| = Python minidom = | = Python XML DOM Minidom = '''`xml.dom`''' is a module for parsing XML. This parser utilizes a minimal implementation of the '''DOM''' ('''D'''ocument '''O'''bject '''M'''odel), and as such offer a DOM-like API. |
| Line 9: | Line 13: |
| === Parsing a file === | === Parse a file === |
| Line 21: | Line 27: |
| }}} === Scrape HTML tables === {{{ for table in document.getElementsByTagName("table"): for row in table.getElementsByTagName("tr"): if row.firstChild is not None and row.firstChild.nodeName is not None and row.firstChild.nodeName=="th": for header in row.childNodes: data[0].append(header.nodeValue if header.nodeValue is not None else "") else: data.append([]) for cell in row.childNodes: data[-1].append(cell.nodeValue if cell.nodeValue is not None else "") |
|
| Line 55: | Line 46: |
| === Scrubbing the DOM === | === Scrape HTML tables === {{{ def recurse_text(node): buffer = "" for child in node.childNodes: if child.nodeType == minidom.Node.TEXT_NODE: buffer += child.data else: buffer += recurse_text(child) return buffer for table in document.getElementsByTagName("table"): for row in table.getElementsByTagName("tr"): data.append([]) for header_cell in row.getElementsByTagName("th"): data[0].append(recurse_text(header_cell)) for cell in row.getElementsByTagName("td"): data[-1].append(recurse_text(cell)) }}} === Scrub the DOM === |
| Line 85: | Line 99: |
| ---- == See also == [[https://docs.python.org/3/library/xml.dom.minidom.html|Python xml.dom.minidom module documentation]] |
Python XML DOM Minidom
xml.dom is a module for parsing XML.
This parser utilizes a minimal implementation of the DOM (Document Object Model), and as such offer a DOM-like API.
Contents
Usage
Parse a file
from xml.dom import minidom document = minidom.parse(filename)
If the XML file uses namespaces, it can be easier to disable that feature in the parser.
from xml.dom import minidom, expatbuilder document = expatbuilder.parse(filename, False)
Traverse all nodes
def recurse_print(node):
if node.nodeType == minidom.Node.TEXT_NODE:
print(node.data)
else:
for child in node.childNodes:
recurse_print(child)
recurse_print(document)
Scrape HTML tables
def recurse_text(node):
buffer = ""
for child in node.childNodes:
if child.nodeType == minidom.Node.TEXT_NODE:
buffer += child.data
else:
buffer += recurse_text(child)
return buffer
for table in document.getElementsByTagName("table"):
for row in table.getElementsByTagName("tr"):
data.append([])
for header_cell in row.getElementsByTagName("th"):
data[0].append(recurse_text(header_cell))
for cell in row.getElementsByTagName("td"):
data[-1].append(recurse_text(cell))
Scrub the DOM
It can be useful to scrub the DOM of unhelpful or useless components.
To remove attributes, try:
if node.hasAttribute("hidden"):
node.removeAttribute("hidden")To remove nodes, try:
for child in node.childNodes:
if child.hasAttribute("hidden"):
node.removeChild(child)
child.unlink()To replace nodes, as with comments, try:
replacement = document.createComment("scrubbed useless node")
# alternatively, createTextNode or createElement
for child in node.childNodes:
if child.hasAttribute("hidden"):
node.replaceChild(child, replacement)
See also
Python xml.dom.minidom module documentation
