The Soul of Nonlinear Editing

I keep thinking about Tom Ohanian’s series on the State of Digital Nonlinear Editing. Specifically these paragraphs in Part 10:

Content that is recorded will then be processed by a variety of AI application suites. Each suite will provide different functionality (e.g. tonal analysis, speech-to-text, etc.) based on the characteristics of the content. … Very rich, detailed, and comprehensive metadata about that content will result without the large number of humans currently associated with these tasks.

At that point, the user will be presented with the text associated with the content. Each word, with exact reference to its precise positioning within the data stream, will be indexed. Manipulation of text (e.g. cut, copy, paste), will, in effect, be the method of editing that content. Picture and sound will follow along. [Emphasis mine]

Readers of my blog know that I think machine learning is going to revolutionize the edit suite; mainly by reducing the need for Assistant Editors to perform ‘mechanical’ tasks like Ingesting, Sync-ing, and Grouping. But I don’t agree with Ohanian here. And I think his point of view, that editing is basically mechanical, represents one of the problems we face when trying to discuss the future of nonlinear editing.

Editing is a visceral experience. Full stop.

Editing will never be as easy as cutting and pasting text because what’s being said is often secondary to how something’s said. Think about the Brett Kavanaugh hearings. You could read transcripts all day long, but his anger is what left its lasting impact.

The primacy of subtext is applicable to all genres of editing, from the biggest tentpole blockbuster to most corporate HR training video. Anyone who’s listened to multiple reads of Voice Over will know firsthand that the same words spoken differently feel very different each and every time. What makes every editor unique is how these subtle differences inform their creative process.

The source/record metaphor is probably a dated way to interact with audio/video media; and smarter tools that assist the editor in finding and selecting media are needed. But I think “Marking IN and Marking OUT to create edit points” is going to be with us for a while because Marking IN and Marking OUT is editing. The problem isn’t the model, it’s that we need to expand our definition of literacy to include video.

Recorder. A perfect Machine Learning use case.

Atlas Obscura and the New Yorker report on a new documentary about a remarkable woman, Marion Stokes, who recorded 70,000 (!!) hours of television on VHS tapes from 1975 until 2012.

Marion Stokes was secretly recording television twenty-four hours a day for thirty years. It started in 1979 with the Iranian Hostage Crisis at the dawn of the twenty-four hour news cycle. It ended on December 14, 2012 while the Sandy Hook massacre played on television as Marion passed away. In between, Marion recorded on 70,000 VHS tapes, capturing revolutions, lies, wars, triumphs, catastrophes, bloopers, talk shows, and commercials that tell us who we were, and show how television shaped the world of today. 

From the documentary’s website “RECORDER: The Marion Stokes Project”.

The 70,000 VHS tapes are currently awaiting digitization by the Internet Archive to be made available to the public. But these tapes also represent the ideal use case for Machine Learning technology like Google Vision to make it all searchable.

This also clearly demonstrates the need for a new editing metaphor, something like Tom Ohanian wrote about on his excellent State of Digital Nonlinear Editing series on LinkedIn.

Because a massive amount of people can read. And if they interact with content not first and foremost via video and audio, but with words, manipulation of content becomes really easy and very accessible. And it will / should work along these lines: Content that is recorded will then be processed by a variety of AI application suites. Each suite will provide different functionality (e.g. tonal analysis, speech-to-text, etc.) based on the characteristics of the content. When a live or recorded stream of content is digitized, it will be subjected to a variety of these suites.


At that point, the user will be presented with the text associated with the content. Each word, with exact reference to its precise positioning within the data stream, will be indexed. Manipulation of text (e.g. cut, copy, paste), will, in effect, be the method of editing that content. Picture and sound will follow along.

Tom Ohanian’s State of Digital Nonlinear Editing and Digital Media 10

(Note: Linkedin’s poor formatting makes these articles more difficult to read than necessary, but stick with it, his series is very insightful and thought provoking.)