While machine learning systems have gotten much better at identifying objects within still frames, the next stage of this process is identifying individual objects within video, which could open up new considerations in brand placement, visual effects, accessibility features and more.
Google has been developing its tools on this front for some time, which has now lead to new advances in YouTube’s options, including the capacity to tag products displayed within video clips, and provide direct shopping options, facilitating broader eCommerce opportunities in the app.
And now, Facebook too is taking the next steps, with a new process that’s much better at singling out individual objects within video frames.
“Working in collaboration with researchers at Inria, we have developed a new method, called DINO, to train Vision Transformers (ViT) with no supervision. Besides setting a new state of the art among self-supervised methods, this approach leads to a remarkable result that is unique to this combination of AI techniques. Our model can discover and segment objects in an image or a video with absolutely no supervision and without being given a segmentation-targeted objective.”
That effectively automates the process, which is a major advance in computer vision technology.
And as noted, that will open up a range of new potential opportunities.
“Segmenting objects helps facilitate tasks ranging from swapping out the background of a video chat to teaching robots that navigate through a cluttered environment. It is considered one of the hardest challenges in computer vision because it requires that AI truly understand what is in an image. This is traditionally done with supervised learning and requires large volumes of annotated examples. But our work with DINO shows highly accurate segmentation may actually be solvable with nothing more than self-supervised learning and a suitable architecture.”
That could help Facebook provide new options, like YouTube, in tagging products for associated display within video content, while as Facebook notes, there are also applications related to AR and visual tools that could lead to much more advanced, more immersive Facebook functions.
And that could also incorporate further data gathering and personalization.
Back in 2017, in the early stages of its video recognition efforts, Facebook noted that advances in the tech would lead to increased capacity to showcase more relevant content to users based on their viewing habits.
“AI inference could rank video streams, personalizing the streams for individual user’s newsfeeds and removing the latency of video publishing and distribution. The personalization of real-time reality video could be very compelling, again increasing the time that users spend in the Facebook app.”
Of course, Facebook probably wouldn’t be as overt in its objectives now, in trying to get users to spend more time consuming content – but that, of course, is its aim, to provide the most compelling, valuable experience for all users, in order to maximize engagement time, and boost its utility and value.
Which also provides it with more advertising opportunities – and again, it’s easy to see how these advanced video recognition tools could be a major boon to Facebook’s advertising business. Indeed, in the YouTube example, it’s actually planning to tag all items in all video clips, not just those where the creator assigns a tag, in order to provide more shoppable product options across the app.
Whether YouTube takes that step or not, we’ll have to wait and see, but it is interesting to consider the broader implications of such advances, and how they could change your marketing and promotional process.
And then there’s AR. With Facebook developing its own AR glasses, it’s also feasible that this technology could be used to better identify objects in your real world view, in order to provide assistance, promotions, and other information.
There’s a wide range of potential use cases, and it’s interesting to see how Facebook’s tools are developing on this front.
You can read the full DINO research paper and insights here.