This article explores how the ICAEW Digital Archive has experimented with and adopted AI-powered metadata extraction workflows, and how we have transitioned from manual metadata creation to AI-assisted processes.
When the ICAEW Digital Archive was established in early 2020, we made a pragmatic decision to ingest assets into Preservica (our chosen digital preservation system) with minimal or, in some cases, no metadata. This decision was made due to limited staff resources. We had also assumed users would navigate to our assets via links from the library catalogue, so we initially focused on preservation over description. Over time, however, several problems with this approach became clear:
We gradually began applying more detailed metadata to ingested assets, but unfortunately without strict metadata guidelines that, in retrospect, were needed. This led to inconsistent metadata application, both between different staff members and over time as styles drifted. Additionally, some assets arrived without any metadata at all, notably our AV assets (webinars, etc.). We simply did not have time to create this metadata ourselves, especially video descriptions that would require watching and reviewing the recordings.
This had all led to a two-fold problem:
In both cases, assets were effectively undiscoverable.
Reviewing and correcting the repository's metadata became necessary, but a completely manual audit was impractical due to the volume of assets. We wondered if artificial intelligence (AI), specifically large language models (LLMs), could help make this task more feasible. Our hope was that AI could not only fill in basic metadata (titles, dates, etc.) but also generate rich asset-level descriptions that were far too time-consuming to write manually, both for text-based assets and our AV assets.
We took confidence from earlier successes incorporating AI tools into our workflow. For example, we had used WhisperX (an enhanced speech-to-text model) to transcribe our AV assets before ingest. This success encouraged us to explore LLMs for metadata extraction.
Initial research looked promising. Alyafeai et al. (2025) indicated that with careful prompt design LLMs can extract bibliographic fields from complex documents and Busch et al. (2025) noted that this metadata can approach the quality of human-created records.
Multiple sources (e.g., Huyen, 2024 and Nair et al., 2025) emphasised that AI output still needs human oversight, i.e. a human-in-the-loop, and should complement rather than replace professional judgment.
In short, we view AI as a powerful assistive tool (i.e. a “first-pass” cataloguer) for tackling our metadata backlog, and for future ingests.
We knew fundamentally what needed to be built: a system that would provide individual assets (or their most informative portions) to an LLM, along with highly detailed instructions on how to extract metadata in a specific style, and return a structured response.
The high-level design can be visualized as follows:
Figure 1: High-level system architecture for AI-powered metadata extraction.

Figure 2: Demonstration of the AI-powered metadata extraction workflow in action, showing the system processing PDF documents and generating structured metadata output.
Our detailed prompt includes:
The JSON metadata output contains complete Dublin Core fields, ICAEW-specific fields, and hierarchical subjects, which we transform into CSV for easy review and editing.
Our development approach focused on rapid prototyping and iterative refinement. We began with simple experiments using OpenAI's GPT models, progressively expanding the prompt with more fields and rules while testing on real assets. This tight feedback loop – build, test, adjust – allowed us to quickly discover what the LLM got right or wrong and improve our instructions accordingly. We used Cursor IDE during development, which significantly sped up our prototyping and allowed us to evolve the workflow organically based on what worked in practice.
The true linchpin of our workflow is the comprehensive prompt, which is essentially a detailed cataloguing manual. A major influence on our methodology was Chip Huyen's book AI Engineering: Building Applications with Foundation Models (2024), particularly the chapter on prompt engineering. We followed Huyen's framework for clear prompting:
Core Prompt Components:
Our prompt begins with: "Your task is to analyse uploaded assets and extract structured metadata following ICAEW-specific conventions based on the Dublin Core schema and internal rules". We provide unambiguous instructions for each metadata field rather than vague guidance. When the model is uncertain, we instruct it to leave fields blank or mark them as "N/A" rather than guessing.
We assigned the AI a specific persona: "You are a metadata archivist for the ICAEW Digital Archive", which helps the model understand the professional standards and perspective required for generating appropriate responses.
We list every metadata field we expect: Title, Creator, Description, Date, Subject, Publisher, Format, Identifier, Language, Type, Relation, etc., along with guidelines for each. For example, for Title we specify the exact format and rules that we expect:
Title (REQUIRED)
- Single value only
- Use the title as it appears in the document
- Use sentence case (capitalize first word only), but acronyms (e.g., OECD, IFRS, FRC, HMRC, UK, VAT) must always be in all capitals, even at the start of the title or after a colon.
- ALWAYS use colons (:) to separate title and subtitle - replace any em-dashes (-), en-dashes (–), or hyphens (-) used as separators with colons
- Do not capitalize the first letter after a colon
- Do not use "&"; use "and"
- Use question marks if applicable, but do not end with full stops
- The order and format should be- title: subtitle, issue/volume, date
- If a title doesn't follow this format, reorganize it accordingly. For example:
* "UK business confidence monitor report: Q4 2012 Scotland" should become "UK business confidence monitor report: Scotland, Q4 2012"
- Use readable date formatting in titles: "15th January 2024" format (e.g., "15th January 2024" not "2024-01-15" or "January 15, 2024")
- Indicate if the content is revised or time-limited
- Use only standard ASCII characters - avoid Unicode characters, smart quotes, or special symbols
- Examples:
* "OECD discussion draft on the application of tax treaties to state-owned entities: including sovereign wealth funds, TAXREP 4/10, 22nd January 2010"
* "IFRS 16 leases"
* "Audit firm governance: a project for the Financial Reporting Council, Ernst and Young LLP response, 3rd February 2009"
* "Audit firm governance: evidence gathering consultation paper, 5th February 2009"
* "Technical release: IFRS 9 implementation, TECH 01/24, 15th January 2024"
Every field has similar detailed instructions.
We provide the model with our full list of authorised subject terms. For example, the prompt includes: "ICAEW Subject Taxonomy: { Accounting; Auditing; Taxation; … }" indicating which terms are top-level and which are narrower terms.
We instruct the AI to use these terms when assigning the Subject field, and to include the broader terms as well if a narrower term is chosen (ensuring hierarchy). For instance, if the asset is about audit quality and the term "Audit best practice" is used, the AI should also include the broader term "Audit and assurance" as a parent subject.
The AI's ability to apply our taxonomy has been impressive - it often selects terms that match very closely to what our internal rules-based taxonomy would select. This mirrors the success of using an LLM to incorporate a hierarchical structure of a taxonomy as reported by Song et al. (2024).
We explicitly require JSON output with specific keys and provide detailed examples. This structured approach ensures consistency and makes the output easily parseable for our CSV conversion pipeline.
We provide concrete examples to demonstrate the expected output format and style. This helps the model understand our specific requirements and reduces ambiguity in the generated responses. For instance:
{
"entity.title": "Commercial insight: expanding the CFO's horizons, September 2020",
"entity.description": "Quarterly special report from the Business and Management Faculty featuring articles and insights on commercial leadership for CFOs and FDs. Contents include: Wanted urgently - the T-shaped finance director; UK CFO insight - no quick bounce back in the next year; Learning how to acquire a broader perspective; Does being a commercial FD just mean saying 'yes' to your CEO?; More pictures, fewer numbers - the CFO's agenda today; Global CFOs see need for agile planning in the downturn; Why collaboration between marketing and finance is essential; Lessons of COVID-19: building a resilient finance function; Recruiters step up search for FDs with commercial acumen; UK CEOs given a 'licence to change'; Employee engagement during COVID-19. (AI generated description)",
"icaew:ContentType": "Report",
"icaew:InternalReference": "20200900-Commercial-Insight-Expanding-The-CFOs-Horizons-Business-And-Management-Faculty-METCAH20201",
"icaew:Notes": "",
"Title": "Commercial insight: expanding the CFO's horizons, September 2020",
"Creator": ["Business and Management Faculty", "ICAEW"],
"Subject": [],
"Description": "Quarterly special report from the Business and Management Faculty featuring articles and insights on commercial leadership for CFOs and FDs. Contents include: Wanted urgently - the T-shaped finance director; UK CFO insight - no quick bounce back in the next year; Learning how to acquire a broader perspective; Does being a commercial FD just mean saying 'yes' to your CEO?; More pictures, fewer numbers - the CFO's agenda today; Global CFOs see need for agile planning in the downturn; Why collaboration between marketing and finance is essential; Lessons of COVID-19: building a resilient finance function; Recruiters step up search for FDs with commercial acumen; UK CEOs given a 'licence to change'; Employee engagement during COVID-19. (AI generated description)",
"Publisher": "Silverdart Publishing",
"Contributor": "",
"Date": "2020-09",
"Type": "Text",
"Format": "pdf",
"Identifier": ["ISBN 978-1-78363-953-3", "METCAH20201"],
"Source": "",
"Language": ["en"],
"Relation": ["Special Report"],
"Coverage": "",
"Rights": ""
}
We provide comprehensive context including Dublin Core schema definitions, ICAEW-specific requirements, and our controlled vocabularies. This context helps prevent hallucination and ensures the model relies on our specifications rather than its internal knowledge. In addition, we supply at least 10 PDF pages from our assets to ensure that the model has enough context to work with when generating asset-level descriptions.
The prompt includes rules, such as: "If you are unsure or the information is not present, leave the field blank or say 'N/A'". We also added a section where the AI is asked to perform a brief self-check of its output (for example, ensuring required fields like Title and Date are not empty, and that subject terms are from the given list).
For a complete understanding of our prompt, see the full prompt configuration in our GitHub repository.
Several pragmatic design choices helped us streamline the project:
PDF as a standard input format: We decided to normalise all assets to PDF format before processing. This simplified our pipeline and ensured the AI model consistently received both text and layout information. We initially tried extracting plain text from assets (using tools like Docling), but found that sending the original PDF file to the LLM yielded significantly better results.
The key advantage is that the API provides both extracted text and page images to the model, allowing the AI to use visual information alongside text. This dual approach is particularly valuable because the AI can "see" structural elements like headings, tables, and formatting cues that get lost in plain text extraction. For example, a report title that is large and centred on a PDF page is correctly recognised as the title, whereas a raw text dump might mix it up with other content. By standardising on PDF input, we leverage the AI's ability to understand document structure and visual presentation, dramatically improving accuracy in identifying titles, authors, and publication information.
Cost management via page limits: Using OpenAI's API for hundreds of pages could get expensive, so we implemented a simple but effective strategy: only send the most relevant portions of each asset to the model. In practice, we often feed the first 5-6 pages and the last 5-6 pages of an asset. Our reasoning is that the front matter (title page, table of contents, executive summary) and back matter (conclusion, references, ISBNs) usually contain the key metadata we need: titles, authors, publication dates, abstracts, etc. Middle pages often contain detailed asset content that isn't as useful for metadata - with the exception of generating summaries for the description field, though we've found that the LLM generally has enough context to generate a quality description from the 10-12 pages it receives and often our assets are shorter than this 10-12 page threshold. By specifying page ranges, we dramatically cut token usage while still capturing the important bits. This approach has kept our costs manageable at approximately $0.03-0.04 per PDF processed.
Rapid iteration approach: We prioritised getting a working prototype quickly, then improved it iteratively. Our initial prompt was incomplete but proved the concept by generating basic titles and descriptions. This early success justified further development, leading to a more comprehensive system with proper validation and CSV export pipelines.
With our prompt engineering approach and pragmatic decisions in place, we implemented the workflow and began processing our asset backlog.
Figure 3: Detailed workflow diagram showing the complete AI-powered metadata extraction process including human-in-the-loop quality assurance, format conversion, temporary PDF creation (if needed) and bulk import procedures.
In short, our AI-assisted workflow writes all the core fields in the correct format and applies controlled terms consistently. Our pipeline produces a CSV of metadata for each asset keyed by stable Preservica asset IDs, which archivists can spot-check or batch-edit before final ingest. This level of standardisation and accuracy would be extremely time-consuming to achieve manually, which leads to the benefits we report next.
The AI workflow has transformed metadata creation from a time-intensive manual process to an efficient background task, enabling us to clear backlogs and make collections accessible faster.
Here's an example of our combined AI workflows. For AV assets like webinars, we first use WhisperX to generate transcripts, we then run these transcripts through our workflow to generate structured metadata including asset-level descriptions.
Before: Minimal Metadata
Figure 4: Screenshot of the webinar asset before AI-powered metadata extraction, showing limited initial metadata with only basic title and ingest date.
The metadata for a webinar before AI processing contained only the bare minimum information - essentially just a title. In this state, the asset is close to being invisible to our users through search.
After: AI-Enriched Metadata
Figure 5: Screenshot of the webinar asset after AI-powered metadata extraction, showcasing enriched and structured metadata including comprehensive description, creators, subjects, and Dublin Core fields.
Following AI metadata extraction, the same asset now includes comprehensive metadata. The AI-generated description provides a concise summary of the webinar's content (we append "(AI generated description)" to maintain provenance), multiple creator entries (speakers and hosting organisations), and subject terms drawn from our controlled vocabulary. These keywords - such as Information technology, Artificial intelligence, and Diversity and inclusion - capture the key themes discussed in the webinar.
It's worth noting that the title shown in Figure 5 required human correction. The AI initially generated a title based on transcript content, but we manually edited it to match the actual webinar title. This is common with AV assets, as speakers rarely spell out formal titles during recordings. This type of correction is generally not needed for text-based documents, where titles are typically clearly stated. This demonstrates the importance of human oversight, where archivists can apply contextual knowledge that the AI lacks.
The result is a transformation from an effectively invisible asset to one that is highly discoverable and contextualised for users. This is as Lamba et al. (2025) report - LLM enrichment can add new access points and improve search in collections with missing metadata.
Implementing this AI-powered workflow taught us many lessons. While we are happy with the results so far, we remain aware that there is room for improvement. We also believe our experience offers some general insights for other digital archives considering similar projects.
Our AI-powered workflow has delivered significant benefits through several key factors:
While our AI-powered workflow has delivered significant benefits, we have encountered several limitations and identified areas for future improvement:
Technical Challenges:
Operational Challenges:
Quality Assurance Considerations:
Our project is ongoing, and we are actively working to address these challenges:
Our journey with AI-powered metadata workflows at the ICAEW Digital Archive has demonstrated that AI can be a powerful tool for addressing the metadata challenges facing digital archives today. By combining the efficiency and consistency of AI with the expertise and oversight of professional archivists, we have achieved remarkable results - transforming metadata creation from a time-intensive manual process to an efficient, scalable workflow that maintains professional standards.
The key to our success has been treating AI as an assistive tool rather than a replacement for human judgment. The human-in-the-loop review process provides essential quality control and allows archivists to focus on higher-level tasks like curation and strategic planning.
This work demonstrates how digital archives can leverage AI to address systemic challenges in discovery and accessibility. As digital collections continue to grow exponentially, traditional manual cataloguing approaches become increasingly unsustainable. AI-assisted workflows offer a path forward that maintains professional standards while enabling archives to keep pace with digital preservation demands.
The future of digital archives will likely involve increasingly sophisticated AI-human collaboration. By establishing these workflows now, we position ourselves to take advantage of emerging capabilities while maintaining the professional standards and ethical practices that define our field. The key is to start small, iterate quickly, and always keep the human archivist at the centre of the process.