Skip to main content
My Blog

Video transcription and summarization with Whisper and ChatGPT

The problem #

I'm tasked with the job of ingesting hundreds of video webinars into a digital preservation system (specifically Preservica). The primary problem lies in the fact that these videos have arrived with minimal metadata - often with nothing else to go by other than a file name.

Ensuring the discoverability and usability of any content that is ingested into Preservica requires the creation of metadata (Dublin Core). Unfortunately, the only method of creating this metadata would involve manually opening each video to assess its content.

The aim is to at least include a brief description for each webinar and assign relevant subject terms from our custom taxonomy. However, due to the volume of videos a manual approach completely impractical due to how long it would take.

The solution #

I'm running the notebook via Google Colab because they offer free GPU access so the processing will be a lot faster than running it locally. I say free, but I decided to buy some compute credits to gain access to an even faster GPU 🔥

The outcome #

This all looks pretty good to me - ChatGPT's description could maybe do with a little fine tuning but overall it is very impressive. It definitely beats having to sit through the webinar that's for sure 🙃

Limitations and Improvements: #

Limits to the token count - ChatGPT requests can use 4097 tokens. I simply slice off the transcript after 17,500 characters (I got to this number with a bit of trial and error). There might be a more sensisible way to address the token limitation. OpenAI says:

Maybe I could also experiment and provide a better context. At the moment I'm keeping it very basic, using this as the ChatGPT prompt:


I guess I will find more limitations and improvements as I do further testing.