We recently added a new feature to our /speech/transcribe Robot enabling it to listen to human-spoken language and output SRT files. You can automatically use these files to caption media in video players. And thanks to Transloadit's LEGO-like composability, you can even burn these files straight into the videos themselves. Today, I'd like to show you just how to do that.

/speech/transcribe Robot

AI is being used in more corners of life than you might expect and for a good reason, too. It can enhance the speed, precision, and effectiveness of human efforts dramatically. Still, the sheer number of data, data sanitization, and video cards required to train complex models can make AI seem intimidating, which may prohibit smaller companies from leveraging it. Transloadit is on a mission to put AI in the hands of more developers by making it that much easier to wield and create value with.

Over the years, we have released a series of AI-based Robots that are simple plug and play. You won't need to spend a second on learning about or training AI models.

The AI Robot we'll be taking a look at today is our /speech/transcribe Robot. This Robot can read and automatically create text files from any audio or video file, negating the need to do any tedious transcribing yourself. It comes with additional benefits out the box, such as being compatible with many languages and having the option to leverage either Amazon or Google's AI with the simple switch of a single parameter. Transloadit handles all the implementation details under the hood and offers one interface that can be plugged into workflows where other Robots join in and greatly enhance the value that these AI bots already provide. The sum becomes much more significant than its parts. Switching providers by changing three characters also allows you to quickly discover which of them serves your use case best. You don't need to worry about error handling or sending files with the proper codecs, bandwidths, etc. Transloadit handles all of this for you, and offers it as a service that can be wielded with a single declarative JSON recipe.

Getting started

In this post, we will be combining our /speech/transcribe and /video/subtitle Robots to transcribe human-spoken language inside a video and then burn the output right back into the video itself. While it is true that doing this by hand can sometimes yield better results, naturally, it is a costly and tedious process — to the degree where it is prohibitive to do at all. And for many purposes, we find our customers deem automatic transcription "good enough" while bringing costs down so drastically that transcribing videos becomes a compelling prospect to begin with.

One use case is sharing videos on social media, where it has been proven that videos with subtitles are much better at grabbing someone's attention than those without.

Often, the transcribed files are also indexed by something like ElasticSearch, by which videos become searchable. A search hit on any spoken word in the video will be returned instead of just hits on the video's title.

Below is the recipe we used to combine our two features. We call these Assembly Instructions to save in a Template in your account for safe re-use at any time. We'll run through each Step and then give it a whirl.

{
  "steps": {
    ":original": {
      "robot": "/upload/handle"
    },
    "transcribed": {
      "use": ":original",
      "robot": "/speech/transcribe",
      "provider": "gcp",
      "format": "srt",
      "result": true
    },
    "subtitled": {
      "use": {
        "bundle_steps": true,
        "steps": [
          {
            "name": ":original",
            "as": "video"
          },
          {
            "name": "transcribed",
            "as": "subtitles"
          }
        ]
      },
      "robot": "/video/subtitle",
      "ffmpeg_stack": "v6.0.0",
      "preset": "ipad-high",
      "result": true
    },
    "exported": {
      "use": ["subtitled", "transcribed"],
      "robot": "/s3/store",
      "credentials": "s3_cred"
    }
  }
}

Our first Step using the /upload/handle Robot is for, well, handling our uploads. Who would've thought? This Step allows you to upload your file. Do note, however, that we have many alternative Robots available for importing files.

Moving onto our second Step "transcribed", in which our before-mentioned AI magic comes into play. We set up a provider parameter with the API value "gcp", but you could also use the "aws" value. Additionally, we are required to set up a format parameter. Since we are trying to embed subtitles into a video, we have opted to use the "srt" format. Alternatively, you could use "webvtt", but its benefits over "srt" won't be required for our use case.

You could export the SRT files separately so that you can overlay the captions in the video player. Still, in this case, we'll burn the generated subtitle into the initially uploaded video file using our /video/subtitle Robot. To merge the two Steps, reference both the upload and the subtitle Steps within the /video/subtitle Robot's use parameter using the use.steps[].as naming syntax.

Finally, use the /s3/store Robot to export our results to Amazon S3. Like our first Step, though, this Step is very interchangeable as we offer many other exporting options.

/video/subtitle Robot

Testing

To test our Template, we thought we'd use a short clip from David Attenborough's phenomenal Planet Earth series.

As you can see, the Assembly was a success. Thanks to AI, we don't have to go through the laborious process of creating a subtitle file by hand.

This workflow can now be integrated into your web/mobile app, back-end service, or command-line utility via 18 convenient SDKs like Uppy or the Go SDK, as well as directly via our JSON REST API.

We look forward to hearing what you'll build with us. Do let us know on Twitter!