How to make videos accessible

Daniel Tonon DT

So, you need to add a video to your website and you need to pass some level of accessibility requirement. Unfortunately making videos accessible isn’t particularly easy. You generally can’t just slap some captions on it and call it a day. Don’t get me wrong, captions that make it clear who is talking and cover important sound effects are certainly important for passing accessibility. Captions do nothing to help blind people watch your video though, so you can’t stop there. No, you also need to provide either an “audio description” and/or a “text description” depending on what level of accessibility you are aiming for. I’ll get into explaining what audio and text descriptions are later. For now, just think of them as alternate versions of your video for people with poor vision.

A bit of a disclaimer before I start. I have been researching the W3C accessibility guidelines for the past few weeks trying to understand them. This is simply my interpretation of what is needed for videos to successfully pass accessibility requirements.

A level: If the video contains important purely visual information, you will need to provide either an audio description or a text description. You can pick which one you would like to provide and there is no need to provide the other.

AA level: An audio description becomes mandatory for any video that contains important purely visual information. Adding a text description is optional.

AAA level: You no longer have a choice. Videos that feature important purely visual information must feature “extended audio descriptions” (I’ll get more into the difference between “audio descriptions” and “extended audio descriptions” later). Text descriptions must also accompany every video, not just the ones that have important purely visual information in them.

Important purely visual information

I’ve been talking a lot about “important purely visual information”, so what counts as being “important” and “purely visual”? Well try watching a video, then watching it again with your eyes closed. Did the second time feel like you were missing out on something? This could be text appearing on screen that wasn’t read to you. The location seen in the video might have changed. There could have been a close up on someone’s face as they silently experienced an intense emotion. An infographic might have been displayed on screen and every aspect of it was not explained audibly to you. Maybe people were doing things on screen without explaining what they were doing as they were doing it. There are a vast number of things that could be considered “important purely visual information” and you will need to figure out for yourself if something would be important to a blind person.

An important thing to be aware of is that a video may have a lot of visual activity, but that does not necessarily mean that an audio description is required. Take the following video for example:

In that video, it uses audio as its primary form of communication. It covers accessibility for deaf people by providing captions. There is a lot of visual activity occurring in this video though. This might make you think that you should provide an audio description to explain all the visuals flying around. Watch the video paying close attention to the visuals and the audio, then watch it again with your eyes closed. Did you feel like you were missing out on any vital information when you had your eyes closed compared to when you had them open? I didn’t. My take away from this is that if the visuals are merely there to support the audio, and they do not introduce any new information that is not present in the audio track, an audio description is not necessary for that video even at the AAA level.

Audio descriptions

An audio description is a version of the video that uses natural pauses in the audio track to squeeze in narration that describes important visual information to the user. If the narration can still be clearly heard and understood, it can be played over music and sound effects. It must not be played over any dialogue though. At AAA level an “extended audio description” is needed instead. This pauses the video frequently to give the narrator as much time as they need to describe what is visually displayed on screen rather than the narrator rushing to squeeze what they must say between breaks in the dialogue.

The following video from BuzzFeed is a great example of an audio description. There are a couple of times when the narration seems to play over dialogue but I assure you that all the dialogue is there. They are using an interesting editing technique to do those parts that I will get into later.

For comparison, this is the same video without the audio description component:

Tips on making an accessible audio description

1. You can edit the audio description video differently

Edit film

It’s ok to edit the video slightly differently to allow more time for the narrator to explain things. For example, look at the following sections of the videos provided above. The audio-description (0:24-0:50) compared to the standard video (0:13-0:30). Both video clips have a voice over playing while she is using her phone. This is just before a scene change. The way the two videos handle the scene change is very different though.

In the standard video, it shows her using her phone only very briefly before the scene change. When it cuts to the new scene, it instantly cuts to the point in the video that is in sync with her speaking.

In the audio description, the clip of her using her phone is much longer. The extra time spent on this phone viewing clip gives the video enough time for the robotic narrator voice to explain that she’s looking at her phone and also enough time to fit Ellen’s main voice over dialogue. When the scene change comes, it doesn’t cut straight to the part of the video that is in synch with Ellen’s dialogue like the main video does. Instead it uses video footage of Ellen saying the voice over dialogue from when she was looking at her phone. This buys time for the robotic narrator to explain the new setting. It looks like the narrator is talking over important information since we see Ellen’s lips moving and no sound is coming out. We aren’t missing out on anything though, we’ve already heard all that information from when she was looking at her phone. When the narrator is done talking, the audio quickly cross fades into the main dialogue which is already synced up to the video. It’s a very smooth way of handling scene transitions.

2. You can pre-emptively explain things


Another good way of handling scene transitions or other visually important information is by explaining them before they happen. This cuts down on the need to do additional editing to cater for the audio description narration.

In the main video, there is a long pause in Jordan’s dialogue between 3:55 and 4:01. This is a big opportunity for the audio description to take advantage of the gap and explain things. As Jordan is walking toward the train, her dialogue starts again and she enters the carriage. This leaves the audio description in a bit of a bind. It has a large amount of time before the dialogue starts again but it also needs to explain that she is entering the carriage without interrupting her dialogue while she enters the carriage. It gets around this problem by explaining that she enters the carriage well before she enters. The narrator states “Jordan walks toward the opening doors and gets in” at 5:34 but the doors don’t even open until 5:40 and she doesn’t enter the carriage until 5:42. That is 8 seconds between the narrator saying that she enters and her physically entering.

This isn’t as bad as you might think. The primary audience for audio descriptions are people who can’t see the video such as blind people. If you can’t see the video, it doesn’t matter that you are being told a bit early that she enters. The important thing is that the purely visual information of her entering the carriage was indeed conveyed to those who can’t see it visually.

3. Explain only key visual information

Vision allows humans to absorb an immense amount of information at practically instantaneous speeds. Unfortunately, it’s not as easy for blind people to absorb the same level of information. While you certainly could spend half an hour explaining every minute detail about a single frame of the video, who is that really helping? It would turn a 1-minute video into a 10-hour long video!

Take this scene for example (5:23 in the audio description).


This is a very busy image with lots of things that you could add to the audio description. There is a white van and a silver car parked on the side of the road to the left. There are 5 orange cylindrical construction cones lining the pathway, 2 to the left and 3 to the right. Each construction cone has 2 reflector strips wrapped around their tips. There is scaffolding made up of silver yellow and green bars creating a walkway over the foot path. There are trees down the left side of the path roughly 10 meters apart. The sun is out and it looks like maybe 4pm in the afternoon. There is a sign on the wall saying, “Hard Hat Area” in orange text against a black background...

As you can see, it’s possible to just go on and on forever with meaningless detail. That is not what people with visual disabilities want. Making an “extended audio description” (literally pausing the video to add the extra audio description details in) would be the only way you would ever even find enough time to fit all that extra explanation in anyway. People with poor vision just want a few key details that they can use to get a general idea of what is going on. They use their imagination to fill in the rest of the missing details. Even if you did give them all that detail, they might not even retain it all. It’s essentially just wasting their time.

The description that the robot narrator gives for this scene change is “Jordan holding cane walks toward camera through a construction walk way”. That’s it! “Jordan holding cane walks toward camera…” describes the action that is currently happening on screen. “…through a construction walk way” is a general description of the new location that this scene is taking place in. Everything else is left up to the viewers imagination.

Text descriptions

Let’s say that you don’t have access to any sort of video editing software and you only need to pass A level accessibility. In that case, adding an “audio description” isn’t an option. You will need to add a “text description” instead.

A “text description”? Well that sounds easy. “This video is about a couple of dogs playing a violin.” There, done. Can I have my pass now?


But I described the video with text!

In terms of passing accessibility, that is not what a “text description” is. A text description needs to provide all the important information that was presented to the user both visually and audibly.

Example text description
To give you an idea of what it takes to make a truly accessible video “text description”, below is a short 50 second video. It is followed by a text description I wrote for it:

The setting is in a park with birds chirping in the background. In the foreground there is a stone couch with a blue phone resting on it. The couch is facing a stone table that is roughly two meters away. There is a teenage girl in a green sweater with short brown hair sitting on the table. She is facing the couch and looking at her phone. There is also a teenage girl dressed in black with long black hair swinging on a swing tied to a tree in the background.

A third teenage girl with shoulder length blond hair wearing a white button up shirt walks onto the scene. “Hi Eren” the blond-haired girl says to the girl with a green sweater. “Hi Armin” the girl in green replies.

Armin takes a seat on the stone couch facing Eren and lets out an exhausted sigh. She notices the phone next to her and picks it up. “Mikasa’s phone” she says with a sly grin. She starts looking through the phone and notices something odd. “What’s this?” she says leaning forward, scrolling through a series of Mikasa’s private messages. “Erin was doing what?” Armin exclaims in shock. She continues to scroll through the messages on the phone continuously shocked at what she is reading, “Levai was adjusting into...what?”.

While Armin was looking through the messages, the girl in black got off the swing and started walking over. “Armin?” she says, noticing the girl looking through her phone.

Armin screams and throws the phone into the air toward Erin. Erin fortunately reacted fast enough to catch it and was quite proud of herself for doing so. “Hi Mikasa” Armin says trying her best to keep calm, acting like she was doing nothing wrong. Erin sneaks a peak at the phone while Mikasa is focused on Armin.

Mikasa turns to Erin and snatches the phone off her. She then heads back to the swing to continue swinging.

Did you read that?” Armin says, eyes wide with disbelief at what she just saw.

“No, I can’t read”, Erin responds without much interest.

“That was about you and Le…” Armin pauses unable to finish her sentence. She raises her hand to her mouth. “I’m going to puke!” she says, getting off the couch and running for the trees in the background.

Erin leaps to her feet, fists clenched. “Oh, come on! It’s not that bad!” she exclaims, as she watches Armin run for the trees.

Tips on writing an accessible video text description

1. Format it like a chapter in a novel or a script from a play


It’s formatted a lot like how you would write a chapter in a novel. If you prefer, you can also write it like it was a script from a play but the same level of detail needs to be portrayed as if it were in novel format. The script format may seem quicker and easier but conveying character emotions is often much easier through the novel format. If emotion is of little importance then the script format might be more suitable. The audio description example from earlier also came with an excellent example of a text description in script format.

2. A lot of detail (but not too much)


There is a lot of detail in the example but only enough detail to carry the story along. As I stated earlier, visuals alone can convey an immense amount of information at practically instantaneous speeds. Including large amounts of unnecessary detail will cause the reader to lose interest. They will also have a much harder time keeping track of the story. For those reasons, keep just enough detail to give a clear picture of what is happening. This is like how detail is handled in audio descriptions. Start off explaining the setting and a little bit about what each character looks like so that the reader has some base information to go off. After that, ease up on the details and let the user’s imagination fill in the gaps. If there is a scene change or a location change, then the scene will need to be set up again by describing the setting and characters. If characters have a costume change, a few details about what the new costume looks like should be mentioned.

In something without much emotion or story in it (like a face to face interview) you will still need to do that initial set up phase explaining the setting and what any new people or clothing in the video looks like.

3. Foreshadowing


Sometimes small details are more important than they might seem. For example, Erin sneaking a peak at the phone was a very minor detail in the video. However, Armin later asked her “Did you read that?”. Armin asking Erin that wouldn’t make sense if Erin never looked at the phone. She couldn’t have read it if she never looked at it. That minor detail of Erin looking at the phone needed to be mentioned in the text description for that line to make sense.

4. Identifying characters


You will want to name characters as soon as possible but not until one character says the name of another character. People watching the video will not know the names of characters until those names are spoken. People reading the text description shouldn’t know the names of characters until they are spoken either. Instead of using a name, use a single unique visual detail about each character to identify them (the blond-haired girl, the girl in black, etc.). As soon as a character is given a name, switch to using their name instead.

Make sure not to mix identifying features. In the example, Erin has short brown hair and a green sweater. If the example sometimes referred to her as “the girl with brown hair” and sometimes as “the girl with a green sweater” it would make her sound like two different people. Pick one unique feature per character and stick to it.

If using the script format, story probably doesn’t matter as much to you. You can just use the real names from the start instead of using unique visual identifiers.

5. Try to convey character emotions as much as possible


Giving characters emotions makes the story more engaging for the reader. The emotion behind what people are saying can also drastically change the meaning behind the words that they say. Good ways to convey emotion (other than explicitly stating what emotion a character is feeling) is by using speaking words other than “say” and “said” or by adding an adjective to explain how they said it. You can also explain the actions the character performed while saying those words. When explaining the characters actions, try to use words that reflect the energy that the character has while performing that action.

For example:
“I love my new sweater” John said, giving his grandmother a hug.
“I love my new sweater” John said gleefully, jumping into his grandmother’s arms for a hug.
“I love my new sweater” John groaned, rolling his eyes as he begrudgingly gave his grandmother a hug.

In script format, you never write the words “say” or “said”. Not having access to speaking words significantly increases the difficulty of conveying emotion in that format. That is why novel format is more suitable for videos that have a large focus on story and emotion in them.

6. You couldn’t see it? Doesn’t necessarily mean it didn’t happen

Eyes Closed

In order to convey emotion better, you may need to mention actions that you can’t necessarily see. For example, at 0:11 in the video Arimin says “Mikasa’s phone”. In the text description I wrote that as: “’Mikasa’s phone’ she says with a sly grin”.

In the video, Armin’s hair is completely covering her face at the point in time when she says that. It makes it impossible to tell what facial expression she has.

Armin has a rather naughty mentality at that point. That naughty mentality is carried across by the tone of her voice in the video. Text can’t portray tone of voice very well though. By saying that she has “a sly grin”, it helps to convey that naughty mentality of hers. You can’t exactly prove that she doesn’t have a sly grin, now can you?

7. Use <strong> instead of ALL-CAPS


Often writers will use ALL-CAPS in novels to signify that a character is YELLING LOUDLY LIKE THIS! You should use <strong> tags instead. Screen readers will often read out ALL-CAPS text as separate letters. For example, instead of reading “CAT” as “cat” it will be read out as “C-A-T”. This is because they generally assume that ALL-CAPS text is an abbreviation for something. Some screen readers are smart enough to figure out that the text should be read out as text rather than as separate letters in an abbreviation. It’s much better to not risk having an entire sentence spelled out to the user one letter at a time though.

(PS. If you are reading this through a screen reader, sorry about saying “yelling loudly like this” in all caps. Just making a point.)

8. Using <em> for emphasis

Sometimes applying emphasis to different words in the same sentence will cause the meaning of the sentence to change. I used this technique when Armin said “Mikasa’s phone” in the example text description. Think about what different meanings can come from emphasising one word over the other.

Mikasa’s phone” (emphasising “Mikasa’s”) suggests that the object being a phone is mostly irrelevant. It could be any item and still be just as interesting. The import thing is that it is Mikasa’s. A creepy guy who had a crush on Mikasa might say it this way. He would be excited about finding any item of Mikasa’s since Mikasa is the main thing he cares about.

“Mikasa’s phone” (emphasising “phone”) suggests that Armin is far more interested in the fact that it is Mikasa’s phone specifically rather than just some random other object Mikasa owns. You can hear the extra emphasis Armin applies to the word “phone” when she says it in the video. She says “Mikasa’s” quite flatly but her pitch fluctuates when she says “phone”. This phone prop drives the whole story. It makes sense that Armin’s interest is to do with the fact this is specifically Mikasa’s phone. If it was just some other random item that Mikasa owns, Armin wouldn’t have as much interest in it.

Since emphasising different words can carry so much meaning behind a sentence, this emphasis needs to be portrayed in a way that everyone (including screen readers) can understand. The way to do that is through <em> tags. These tags will tell all users (including screen readers) exactly what words should have emphasis so no hidden meaning is lost in translation. If you ever hear obvious emphasis getting applied to a word in a video, that word should be wrapped in <em> tags. Not doing so may cause important subtext to be lost in translation.

9. Non-standard punctuation


You may sometimes see writers use the question/exclamation/question mark technique (?!?) as a short cut to get across that a character is both shocked and confused (“What?!?”). This is far more likely to confuse a screen reader than help it. Just stick to either the question mark or the exclamation mark. Don’t use both at once. The best way to represent “What?!?” is with bold text and a question mark (“What?”).

10. Censored swearing


If you need to censor swearing, use f***ing stars. Don’t use the f@!&ing comic book technique of using random symbols. It’s extremely confusing for screen reader users.

11. Music


Let’s say that this is a music video for a new song that an artist has released. To write a text-description that will pass accessibility requirements, you can’t just provide the lyrics for the song and call it a day. The text description should also make note of various details about the style of the music like genre, tempo, mood etc. Any major shifts in the musical style should also be mentioned at the point in the lyrics where the shift occurs. Don’t forget that the visuals in the video also need to be explained. There are many non-disabled users that only want to see the lyrics though. To make their experience better, it would be best to provide the highly detailed text description as an alternative view to the lyrics only text. Between the detailed text description and the lyrics only versions, the lyrics only version should be the one that is displayed by default.

If a video has background music, the emotion that the music is trying to convey is typically what is most important. Just saying “sad music plays in the background” would probably be enough since it isn’t the main focus of the video like it is in a music video. Sometimes the nationality of the music is important as well. Let’s say a crew of people shaking maracas and wearing sombreros come out and start dancing. If Mexican music starts playing then the fact that it is Mexican is definitely an important detail that needs to be mentioned in the text description.

12. Make it visible to all


If you are going to put the effort into writing a proper text description, why would you visually hide it? There are likely to be regular users who will enjoy reading the written version more than watching the video. These alternate options need to either be provided directly on the page near the main video or have links to them near the main video. “Near the main video” means both “near” in the source order as well as page layout.

Video/audio only pre-recorded media

Video only:

If a video has no audio, the video must provide one of two things. It can provide a detailed explanation of all the important visual information in the video (essentially a “text description”). It could alternatively provide a version of the video that has an audio track explaining all the important visual information being shown in the video (essentially an “audio description”).

Audio only:

Audio that has no video content must have a text description that provides a detailed explanation of what is contained in the audio. Something like a podcast would need a transcript written up for it. A music track would need to follow the same sort of rules that the video text descriptions have around music. That means not only providing the lyrics for the song but also provide various details about the music like genre, tempo, mood etc. Don’t forget to mention any major shifts in the music style that occur.

Live media

If you only need to reach “A” level accessibility, then guess what? You’re off the hook! You don’t need to do anything special for live video feeds at A level.

If you are reaching for “AA” level though, you will need to invest in a live captioning service such as the one provided by The Captioning Studio.

If you are reaching for “AAA” level, you will also need to fork out for that live captioning service on any live audio only broadcasts.

Flashing content


If you’re video has any rapidly flashing content in it (content that flashes more than three times per second). Then the easiest way to pass this criterion is to simply not include the video on your site. An explicit warning at the start of the video stating that it contains flashes of light won’t help you pass accessibility either. A warning would just tell people with epilepsy that they can’t watch the video (thus making it inaccessible to them).

The W3C recommends editing the video to slow the flash rate down to less than 3 times per second. If that isn’t an option, and you are aiming for AAA then you have no choice. You must not include the video on your site. If you are only aiming for A or AA though, you have one more option available to you… it’s a bit complicated and risky though.

There is an acceptable threshold for flashing content on A and AA sites. This threshold basically states that the flashing content can’t take up any more than 10 degrees of vision. This is how the W3C explains the flash threshold for A and AA sites:

“The 1024 x 768 screen is used as the reference screen resolution for the evaluation. The 341 x 256-pixel block represents a 10 degree viewport at a typical viewing distance.”

Based on that explanation, this is how I interpret the threshold. 341/1024 roughly equals 1/3 and 256/768 equals exactly 1/3. So based on that, the rule seems to be that flashing content can't take up any more than 1/3 of the available screen size at any one time both in terms of height and width. The maximum width of a video with flashing content on an iPhone 5 would only be roughly 100px wide. The maximum size on a desktop sized screen would be 341 x 256. I wouldn’t go any larger than that even on a screen larger than 1024 x 768.

If a video with flashing machine gun fire is small enough to fit in those dimensions, technically it is ok to place it on A and AA sites. It is still risky to put any sort of rapidly flashing content on a website though even if it is within the size limit threshold. After all, the user might be looking at the screen a bit more closely than expected. It’s much safer to either edit the video to slow down/remove the flashes or simply avoid placing the video on the site in the first place.

Other multi-media accessibility requirements

Level A:

The only other thing you need to do is make sure that if you are auto-playing a video (or just audio by itself), if the sound lasts longer than 3 seconds, you need to give users the ability to either pause, stop or mute the video/audio.

Level AAA:

When you get to AAA level accessibility you will need to consider hiring a sign language interpreter. Either the original video or another version of it will need to have a sign language interpreter embedded into it to pass.

AAA level also requires there to be little to no background audio over any spoken dialogue. Acceptable background noise is 20 decibels softer than the speech that is playing over it. That usually equates to roughly four times quitter. There are two exceptions though. Audio that is part of a CAPTCHA mechanism (one of those things that checks you are human) is excused from the requirement since it defeats the purpose of the CAPTCHA. Any musical audio of singing or rapping is also excused from having to conform to this requirement. High quality audio recorded in a studio is likely to pass this easily. Audio recorded on some kid’s phone as they were walking near a construction site on a windy day is less likely to pass. The microphone would be picking up on all the background noise from the construction site and it would also be getting buffeted by the wind. Another possible fail might be too much reverb if the audio was recorded in something like a bathroom.

In conclusion

If you need to pass a certain accessibility requirement, you may want to think twice before trying to add video and audio content to your site. You can certainly add it if you like, but be aware that there is often a lot of extra work involved in making that media accessible to all users. This is especially true if it is a video with a heavy focus on story and emotion. You should aim to convey all important information in the audio track of the main video if possible. Video’s that convey all important information audibly don’t need to do anything special other than adding captions to pass AA accessibility.

Do you need help with web accessibility?