Python XML Processing

Python's xml modules enable you to parse, create, and modify XML data. The main modules include xml.etree.ElementTree (for parsing and creating XML) and xml.dom.minidom (for DOM manipulation).

1. Parsing XML with xml.etree.ElementTree

The xml.etree.ElementTree module is commonly used to parse XML from strings or files. The ElementTree.parse() method loads XML data and returns a tree structure.
import xml.etree.ElementTree as ET

# Sample XML data
xml_data = """<library>
    <book id="1">
        <title>Python Programming</title>
        <author>John Smith</author>
    </book>
    <book id="2">
        <title>Advanced Python</title>
        <author>Jane Doe</author>
    </book>
</library>"""

# Parse XML string
root = ET.fromstring(xml_data)
print("Root tag:", root.tag)

Output:

Root tag: library
Explanation: ET.fromstring() parses XML from a string and provides access to the root element.

2. Accessing Elements and Attributes

Once parsed, you can access child elements, attributes, and text content using Element objects.
# Access each book and print details
for book in root.findall("book"):
    title = book.find("title").text
    author = book.find("author").text
    book_id = book.get("id")
    print(f"Book ID: {book_id}, Title: {title}, Author: {author}")

Output:

Book ID: 1, Title: Python Programming, Author: John Smith
Book ID: 2, Title: Advanced Python, Author: Jane Doe
Explanation: findall() retrieves elements by tag name, get() retrieves attribute values, and text accesses element content.

3. Creating XML Documents

Use ElementTree to build XML documents from scratch by creating elements and setting their attributes and text content.
# Create root element
library = ET.Element("library")

# Add book elements
book1 = ET.SubElement(library, "book", id="3")
ET.SubElement(book1, "title").text = "Learning XML"
ET.SubElement(book1, "author").text = "Tom Brown"

book2 = ET.SubElement(library, "book", id="4")
ET.SubElement(book2, "title").text = "XML and Python"
ET.SubElement(book2, "author").text = "Sue Johnson"

# Generate string from XML structure
tree = ET.ElementTree(library)
ET.dump(library)  # Displays XML structure

Output:

<library>
    <book id="3">
        <title>Learning XML</title>
        <author>Tom Brown</author>
    </book>
    <book id="4">
        <title>XML and Python</title>
        <author>Sue Johnson</author>
    </book>
</library>
Explanation: ET.Element() creates a root element, ET.SubElement() adds child elements, and ET.dump() outputs the entire XML structure.

4. Writing XML to a File

To save XML data to a file, use ElementTree.write().
# Write XML to a file
tree.write("library.xml", encoding="utf-8", xml_declaration=True)
Explanation: The write() method saves XML data to a file, with optional encoding and xml_declaration arguments.

5. Pretty-Printing XML with xml.dom.minidom

The xml.dom.minidom module provides pretty-printing capabilities for XML output.
from xml.dom import minidom

# Convert ElementTree to string and pretty-print
xml_str = ET.tostring(library, encoding="unicode")
pretty_xml = minidom.parseString(xml_str).toprettyxml(indent="  ")
print("Pretty-printed XML:\n", pretty_xml)

Output:

Pretty-printed XML:
<?xml version="1.0"?>
<library>
  <book id="3">
    <title>Learning XML</title>
    <author>Tom Brown</author>
  </book>
  <book id="4">
    <title>XML and Python</title>
    <author>Sue Johnson</author>
  </book>
</library>
Explanation: minidom.parseString().toprettyxml() adds indentation to make XML human-readable.

6. Modifying XML Elements

Modify XML elements by changing their attributes or text content.
# Modify the title of the first book
book_to_modify = library.find("book")
book_to_modify.find("title").text = "Mastering XML"
print("Modified XML structure:")
ET.dump(library)

Output:

Modified XML structure:
<library>
    <book id="3">
        <title>Mastering XML</title>
        <author>Tom Brown</author>
    </book>
    <book id="4">
        <title>XML and Python</title>
        <author>Sue Johnson</author>
    </book>
</library>
Explanation: Updating an element’s text property modifies the XML data.

7. Parsing XML with xml.sax (Event-Driven)

The xml.sax module uses event-driven parsing, suitable for large XML files, as it processes elements incrementally.
import xml.sax

# Define a SAX handler to process each element
class BookHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        if name == "book":
            print(f"Book ID: {attrs['id']}")
    
    def characters(self, content):
        if content.strip():
            print("Content:", content)
    
    def endElement(self, name):
        if name == "book":
            print("--- End of book ---")

# Parse XML with SAX
xml.sax.parseString(xml_data, BookHandler())

Output:

Book ID: 1
Content: Python Programming
Content: John Smith
--- End of book ---
Book ID: 2
Content: Advanced Python
Content: Jane Doe
--- End of book ---
Explanation: The BookHandler class processes XML elements as they’re encountered, making it efficient for large files.

Summary

Python’s xml.etree.ElementTree, xml.dom.minidom, and xml.sax modules offer versatile options for parsing, creating, modifying, and saving XML data, catering to various processing needs.

Previous: Python Regular Expressions | Next: Python Math

<
>