Two Days of Digital Preservation: Web Archiving and QA with Agentic AI

Archiving 546 press releases to the Wayback Machine, crawling 5 websites to WACZ format (4 archived), and validating archives with custom QA tooling-all through natural language instructions.

Introduction

This article documents two days working with Mima, my agentic AI assistant, on digital preservation tasks. Each presented different challenges; what unites them is the exploratory nature of the work - I didn't know the exact approach needed until we hit the obstacles.

TL;DR - Key outcomes:

This post walks through three tasks: (1) Wayback archiving, (2) WACZ crawling, and (3) QA + recrawl iteration.

Method note: how this was written
These posts are built around what Mima reports having done. The narrative is Mima's account, checked where it mattered (spot-checks, QA, verification). Memory is Mima's biggest weakness - context can be lost between turns, and when drafting this article it hallucinated acronym expansions (CPIA, BBHSCA, etc.) despite having written correct metadata earlier. So what you're reading is Mima's reported account, with human review and corrections.

Mima acknowledging the acronym hallucination and adding it as a limitation in the blog post

Figure 1: Mima acknowledging the acronym hallucination and adding it as a limitation in the blog post.

This experiment was conducted on personal equipment using only publicly accessible archive content. No ICAEW credentials, internal systems, or confidential data were involved. See the Limitations section for full details.

1. CIOT Press Release Archiving

The Task

The Chartered Institute of Taxation (CIOT) publishes press releases at tax.org.uk. I wanted to ensure these were preserved in the Internet Archive's Wayback Machine-a straightforward task in principle, but one that revealed interesting technical challenges.

Initial Approach (Failed)

The standard approach is simple: POST each URL to https://web.archive.org/save/{url}. Mima wrote a script to iterate through the 546 URLs. Initial results looked promising-most succeeded-but 35 URLs consistently returned HTTP 520 errors.

HTTP 520 is a Cloudflare-specific status code indicating the origin server returned an unexpected response. The Wayback Machine's archiving bot was being blocked.

The Solution: Browser-Like Headers

After investigating, Mima determined that Cloudflare was filtering requests based on headers. The solution was to make the save requests appear browser-like:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
    "Referer": "https://www.tax.org.uk/"
}

response = requests.post(
    f"https://web.archive.org/save/{url}",
    headers=headers,
    timeout=120
)

With browser-like headers, all 35 previously-failing URLs archived successfully.

Verification via CDX API

Rather than checking each URL individually, Mima used the Wayback Machine's CDX API to verify the entire collection:

https://web.archive.org/cdx/search/cdx?url=www.tax.org.uk/*&output=json

This returned 60,775 archived snapshots for the tax.org.uk domain. Cross-referencing against our 546 URLs confirmed complete coverage.

Results

Metric Count
Total URLs 546
Already in Wayback (pre-existing) 384
Archived in this session 165
Failed 0

(Counts differ slightly due to redirects and URL normalisation.)

The script respected Internet Archive's rate limits (no bulk hammering). The entire process - including the Cloudflare troubleshooting - took approximately 2 hours.

I spot-checked a sample of ~25 archived URLs and confirmed the archiving had been completed as reported.

2. Browsertrix Web Archiving

The Task

Archive five UK accountancy association websites to WACZ format with full-text indexing, suitable for offline access and long-term preservation.

Target sites:

  1. CPIA (cpia.org.uk) - Centre for Public Interest Audit
  2. BBHSCA (bbhsca.org.uk) - Beds, Bucks and Herts Society of Chartered Accountants
  3. MCASS (mcass.uk) - Manchester Chartered Accountants Students Society
  4. TVSCA (tvsca.org.uk) - Thames Valley Society of Chartered Accountants
  5. CharteredOne (charteredone.co.uk) - Accountancy recruitment - attempted; failed due to DNS (domain no longer resolves), not archived

Five sites were in scope; four were successfully archived and ingested into Preservica.

Approach

Mima installed Docker, pulled the browsertrix-crawler image, and configured and ran all of the crawls itself. No pre-configured pipeline - it set up the environment and executed high-fidelity web crawling:

docker run -v /mnt/ssd/web-archives/cpia:/crawls \
  webrecorder/browsertrix-crawler crawl \
  --url https://cpia.org.uk/ \
  --scopeType domain \
  --collection cpia-2026-02-10 \
  --text \
  --generateWACZ \
  --limit 500

I’d specified Browsertrix for replay fidelity and JavaScript support (unlike wget/HTTrack); Mima did the rest-Docker, image, and per-site crawl config. WACZ was the output so archives stay portable and ingest-friendly for Preservica.

For each site, Mima also created a metadata.json file documenting provenance. Example (CPIA, after QA and re-crawl):

{
  "entity.title": "Centre for Public Interest Audit (CPIA), 11th February 2026",
  "entity.description": "Web archive of the Centre for Public Interest Audit (CPIA) website. CPIA is a policy and research institute formed to improve audit quality across the UK accountancy profession. The organisation advocates for meaningful audit reform and provides evidence-based thought leadership on profession-wide issues affecting public interest entities (PIEs). The archive includes pages on: who we are; FAQs; news; research including the Audit Trust Index; events; contact information; and linked PDF documents.",
  "icaew:InternalReference": "20260211-CPIA-Website-Archive",
  "icaew:ContentType": "Website",
  "icaew:Notes": "Full website crawl using Browsertrix-Crawler with domain scope and full-text indexing enabled. 35 pages/resources captured (24 sitemap URLs plus 11 discovered pages and PDFs). Sitemap: http://www.cpia.org.uk/sitemap_index.xml. Company number 15805869, registered at Chartered Accountants' Hall, Moorgate Place, London, EC2R 6EA.",
  "Title": "Centre for Public Interest Audit (CPIA): website archive, 11th February 2026",
  "Creator": ["Centre for Public Interest Audit"],
  "Subject": ["Audit", "Public interest", "Audit quality", "Financial reporting", "Professional bodies", "PIE", "Public interest entities"],
  "Description": "Web archive of the Centre for Public Interest Audit (CPIA) website. CPIA is a policy and research institute formed to improve audit quality across the UK accountancy profession. The organisation advocates for meaningful audit reform and provides evidence-based thought leadership on profession-wide issues affecting public interest entities (PIEs). The archive includes pages on: who we are; FAQs; news; research including the Audit Trust Index; events; contact information; and linked PDF documents.",
  "Publisher": "ICAEW",
  "Contributor": ["Centre for Public Interest Audit"],
  "Date": "2026-02-11",
  "Type": "Interactive resource",
  "Format": "application/wacz",
  "Identifier": ["https://www.cpia.org.uk/"],
  "Language": ["en"],
  "Source": "https://www.cpia.org.uk/sitemap_index.xml",
  "Coverage": "United Kingdom",
  "Rights": "Content copyright Centre for Public Interest Audit. Archived for preservation purposes."
}

Results

Site Pages WACZ Size Status
CPIA 37 54 MB ✓ Complete
BBHSCA 28 26 MB ✓ Complete
MCASS 52 128 MB ✓ Complete
TVSCA 44 108 MB ✓ Complete
CharteredOne 0 13 KB ✗ DNS failure

These figures reflect the final state after the QA-driven recrawl described in §3.

CharteredOne's domain (charteredone.co.uk) no longer resolves-another example of why web archiving matters. The domain may have lapsed or the organisation may have ceased operations.

3. Quality Assurance with Custom Tooling

The Task

The following day, I asked Mima to validate the web archives using a QA script I'd written for ICAEW's digital archive workflow. The script (web_archive_validator.py) compares a list of expected URLs against what's actually captured in a WARC or WACZ file.

The script: github.com/icaew-digital-archive/digital-archiving-scripts

Initial QA Run (Problem Discovered)

The crawler missed 83% of MCASS. QA caught it.

Mima fetched the script, installed dependencies (warcio, tqdm), extracted URL lists from each site's sitemap, and ran validation against the WACZ files from the previous day.

Results revealed a problem:

Site URLs in Sitemap Captured Status
CPIA 24 23 ⚠️ Minor gap
BBHSCA 10 10 ✓ Complete
MCASS 46 8 ✗ Major failure
TVSCA 10 10 ✓ Complete

MCASS had only captured 8 of 46 pages-a 17% success rate. Investigation revealed the issue: the site has many orphaned pages that aren't linked from the main navigation or other pages, so the crawler never discovered them by following links.

Iteration 1: Explicit Sitemap Seeds, Page Scope

I asked Mima to delete the crawls and redo them with:

This fixed the MCASS problem-all 46 pages captured. But I realised page scope was too restrictive; it wouldn't discover linked PDFs or policy pages not in the sitemap.

Iteration 2: Domain Scope

I asked for another redo with domain scope-allowing the crawler to follow links within each domain while still using the sitemap as the starting seed list.

Final results:

Site Sitemap URLs Total Crawled QA Validated WACZ Size
CPIA 24 37 24/24 ✓ 54 MB
BBHSCA 10 28 10/10 ✓ 26 MB
MCASS 46 52 45/46 ⚠️ 128 MB
TVSCA 10 44 10/10 ✓ 108 MB

The single MCASS "missing" URL was https://www.mcass.uk (no trailing slash). This was a URL normalisation issue, not a capture failure - the content was archived at https://www.mcass.uk/.

Discovered Content

Domain scope found significantly more content than the sitemaps alone:

Automated Reporting

Mima generated Dublin Core metadata for each archive in ICAEW Preservica's format. The final QA report and all metadata files were emailed to me with attachments-12 CSV files (matching/missing/non-200 for each site) plus the 4 metadata JSONs. The validated archives were then ingested into ICAEW's Preservica: CPIA, BBHSCA, MCASS, TVSCA. To my knowledge, this is among the first documented workflows where an AI agent produced web archives that were then QA-validated and ingested into a production preservation system.

Lessons

The problem was underspecification - seeds (e.g. sitemap URLs for sites with orphaned pages), scope (sitemap-only vs. domain discovery), and what "done" looks like (validate against expected URLs). Once I was specific, Mima did the work exactly as intended.

This is worth stressing: it installed Docker and browsertrix-crawler and configured every crawl, used my own custom QA script (web_archive_validator.py) to validate against sitemaps, generated Dublin Core metadata in our Preservica format, and produced ingest-ready WACZ and CSVs - all from natural language. No hand-holding on the tooling; it picked up and used what was there.

The same pattern fits future work: define seeds, scope, and validation up front; run the crawl; QA; iterate if needed. Longer term, with Preservica ingest credentials available to Mima, the aim is simple: a digital archivist says “make a web archive of this” and Mima runs the full pipeline - crawl, validate, generate metadata, ingest into Preservica - on its own. We’re not there yet (security at this stage is the main blocker), but the workflow is already proven.

4. Observations

What Worked Well

Limitations

Pre-compaction memory flush: store durable memories or reply NO_REPLY; after NO_REPLY, Mima replied "Three what? 🌙"

Figure 2: Pre-compaction memory flush. After replying 3, Mima had lost the context of the three options and asked "Three what? 🌙"

Conclusion

Three tasks - Wayback archiving, WACZ crawling, and QA-led recrawl-completed over two days through natural language conversation. The agentic approach proved particularly valuable when requirements were unclear upfront: discovering that Cloudflare was blocking requests, or that MCASS needed explicit seed URLs because many pages weren't linked, required iterative exploration that would be difficult to encode in a fixed pipeline.

The QA loop was especially instructive: Mima performed the archiving, then used my own validation tooling to verify its work, discovered a failure, and iterated until the output met quality standards. This human-in-the-loop pattern-where I set requirements and reviewed results while Mima handled execution-proved effective for exploratory preservation work.

The preservation outcomes are tangible: 546 press releases secured in the Wayback Machine, and 4 websites archived with 161 pages total (including newsletters and committee documents that weren't in sitemaps) - now in Preservica (CPIA, BBHSCA, MCASS, TVSCA).


Craig McCarthy is the Digital Archive Manager at the Institute of Chartered Accountants in England and Wales (ICAEW) Library. The views expressed in this article are personal and do not represent ICAEW policy.