Video transcription and summarization with Whisper and ChatGPT

The problem #

I'm tasked with the job of ingesting hundreds of video webinars into a digital preservation system (specifically Preservica). The primary problem lies in the fact that these videos have arrived with minimal metadata - often with nothing else to go by other than a file name.

Ensuring the discoverability and usability of any content that is ingested into Preservica requires the creation of metadata (Dublin Core). Unfortunately, the only method of creating this metadata would involve manually opening each video to assess its content.

The aim is to at least include a brief description for each webinar and assign relevant subject terms from our custom taxonomy. However, due to the volume of videos a manual approach completely impractical due to how long it would take.

The solution #

Use WhisperX to transcribe the videos. WhisperX being built on OpenAI's ASR model but with the added functionality of word-level timestamps. WhisperX also outputs subtitle files which can be ingested along with the videos which makes the videos essentially full-text searchable.
Send the transcript to ChatGPT for summarization.
Send the transcript to Smartlogic Semaphore to suggest relevant topics. This is my organisation's taxonomy classification system.

I'm running the notebook via Google Colab because they offer free GPU access so the processing will be a lot faster than running it locally. I say free, but I decided to buy some compute credits to gain access to an even faster GPU 🔥

The outcome #

I have this webinar - 20201119-CJRS3-furlough-scheme-reinvented.mp4
WhisperX transcribed it, outputting .txt, .json, .srt, .tsv, and .vtt files.
ChatGPT describes the webinar as:
"The video is a webinar on the latest iteration of the furlough scheme, specifically the Coronavirus Job Retention Scheme Version 3. The speaker discusses various aspects of the scheme, including how to participate, deadlines for making claims, eligibility criteria for employees, re-engagement of employees, and changes in calculating usual hours and pay. The video also mentions the importance of accurate RTI submissions for employees on Universal Credit."
Semaphore has assigned these topics:
- Employee retention
- Employment and human resources
- Employment taxation
- Taxation
- PAYE and RTI
- Pay rights
- Coronavirus
- Business operations
- Employment issues
- Government and public sector finance

This all looks pretty good to me - ChatGPT's description could maybe do with a little fine tuning but overall it is very impressive. It definitely beats having to sit through the webinar that's for sure 🙃

Limitations and Improvements: #

Limits to the token count - ChatGPT requests can use 4097 tokens. I simply slice off the transcript after 17,500 characters (I got to this number with a bit of trial and error). There might be a more sensisible way to address the token limitation. OpenAI says:

The limit is currently a technical limitation, but there are often creative ways to solve problems within the limit, e.g. condensing your prompt, breaking the text into smaller pieces, etc. (from: What are tokens and how to count them? )

Maybe I could also experiment and provide a better context. At the moment I'm keeping it very basic, using this as the ChatGPT prompt:

"What follows is a transcript of a video. Using the transcript, summarize the video's content in 3 sentences: {file_content[:17500]}"

I guess I will find more limitations and improvements as I do further testing.

Previous: Nature of code - Randomness
Next: File identification and metadata generation with ChatGPT