Comprehensive processing extension for web markup languages (XML and HTML) that enables SQL-native analysis of structured documents with intelligent schema inference, XPath-based data extraction, and powerful HTML table parsing capabilities.
Installing and Loading
INSTALL webbed FROM community;
LOAD webbed;
Example
-- Load the extension
LOAD webbed;
-- Read XML files directly into tables
SELECT * FROM 'data.xml';
SELECT * FROM read_xml('config/*.xml');
-- Parse and extract from XML content using XPath
SELECT xml_extract_text('<book><title>Database Guide</title></book>', '//title');
-- Result: "Database Guide"
-- Parse and extract from HTML content
SELECT html_extract_text('<html><body><h1>Welcome</h1></body></html>', '//h1');
-- Result: "Welcome"
-- Extract HTML tables directly into DuckDB
SELECT * FROM html_extract_tables('<table><tr><th>Name</th><th>Age</th></tr><tr><td>John</td><td>25</td></tr></table>');
-- Extract links and images from HTML pages
SELECT html_extract_links('<a href="https://example.com">Click here</a>');
SELECT html_extract_images('<img src="photo.jpg" alt="Photo" width="800">');
-- Convert between XML and JSON formats
SELECT xml_to_json('<person><name>John</name><age>30</age></person>');
SELECT json_to_xml('{"name":"John","age":"30"}');
About webbed
DuckDB XML is a comprehensive extension that brings powerful XML and HTML processing capabilities to DuckDB, enabling SQL-native analysis of structured documents. The extension provides three core areas of functionality:
XML Processing & Analysis: Parse, validate, and extract data from XML documents using full XPath 1.0 expressions. Functions include xml_extract_text(), xml_extract_elements(), xml_extract_attributes(), xml_valid(), and xml_stats() for comprehensive document analysis. The extension handles namespaces, comments, CDATA sections, and provides utilities like xml_pretty_print() and xml_minify().
HTML Processing & Web Scraping: Advanced HTML parsing capabilities with specialized functions for web data extraction. Extract text content with html_extract_text(), parse HTML tables into structured data with html_extract_tables(), extract links with metadata using html_extract_links(), and extract images with attributes using html_extract_images(). Perfect for web scraping and HTML document analysis workflows.
Smart Schema Inference & File Reading: Automatically flatten XML/HTML documents into relational tables with intelligent type detection for dates, numbers, booleans, and nested structures. Functions like read_xml() and read_html() provide direct file-to-table conversion with configurable options for error handling, maximum file sizes, and schema customization.
Key XML Functions:
read_xml(pattern)- Read XML files with automatic schema inferencexml_extract_text(xml, xpath)- XPath-based text extractionxml_extract_elements(xml, xpath)- Extract structured elementsxml_extract_attributes(xml, xpath)- Extract attributes as structsxml_to_json(xml)/json_to_xml(json)- Format conversionsxml_stats(xml)- Document statistics and analysisxml_validate_schema(xml, xsd)- XSD schema validation
Key HTML Functions:
read_html(pattern)- Read HTML files into tableshtml_extract_tables(html)- Extract HTML tables as structured data (this will often require post-processing given the possible complexity of HTML tables)html_extract_links(html)- Extract all links with metadatahtml_extract_images(html)- Extract images with attributeshtml_extract_text(html, xpath)- XPath-based HTML text extractionparse_html(content)- Parse HTML strings into structured format
See: https://github.com/teaguesterling/duckdb_webbed/blob/main/README.md for more examples and details.
Built on libxml2 for robust, standards-compliant parsing with comprehensive error handling, memory-safe RAII implementation, and 100% test coverage. The extension supports mixed file systems, configurable schema inference, and efficient processing of large document collections.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| html_extract_images | scalar | NULL | NULL | |
| html_extract_links | scalar | NULL | NULL | |
| html_extract_table_rows | scalar | NULL | NULL | |
| html_extract_tables | table | NULL | NULL | |
| html_extract_tables_json | scalar | NULL | NULL | |
| html_extract_text | scalar | NULL | NULL | |
| json_to_xml | scalar | NULL | NULL | |
| parse_html | scalar | NULL | NULL | |
| read_html | table | NULL | NULL | |
| read_html_objects | table | NULL | NULL | |
| read_xml | table | NULL | NULL | |
| read_xml_objects | table | NULL | NULL | |
| to_xml | scalar | NULL | NULL | |
| xml | scalar | NULL | NULL | |
| xml_extract_all_text | scalar | NULL | NULL | |
| xml_extract_attributes | scalar | NULL | NULL | |
| xml_extract_cdata | scalar | NULL | NULL | |
| xml_extract_comments | scalar | NULL | NULL | |
| xml_extract_elements | scalar | NULL | NULL | |
| xml_extract_elements_string | scalar | NULL | NULL | |
| xml_extract_text | scalar | NULL | NULL | |
| xml_libxml2_version | scalar | NULL | NULL | |
| xml_minify | scalar | NULL | NULL | |
| xml_namespaces | scalar | NULL | NULL | |
| xml_pretty_print | scalar | NULL | NULL | |
| xml_stats | scalar | NULL | NULL | |
| xml_to_json | scalar | NULL | NULL | |
| xml_valid | scalar | NULL | NULL | |
| xml_validate_schema | scalar | NULL | NULL | |
| xml_well_formed | scalar | NULL | NULL | |
| xml_wrap_fragment | scalar | NULL | NULL |
Added Types
| type_name | type_size | logical_type | type_category | internal |
|---|---|---|---|---|
| HTML | 16 | VARCHAR | STRING | true |
| XML | 16 | VARCHAR | STRING | true |
| XMLFragment | 16 | VARCHAR | STRING | true |