2月,OpenAI发布了由其生成人工智能程序Sora创建的视频。通过简单的文本提示制作的惊人逼真的内容是展示人工智能技术能力的公司的最新突破。它还引发了人们对生成人工智能在大规模创造误导和欺骗内容方面的潜力的担忧
根据德雷塞尔大学的最新研究,目前检测被操纵的数字媒体的方法对人工智能生成的视频无效;但机器学习方法可能是揭开这些合成创造面纱的关键
在6月IEEE计算机视觉和模式识别会议上发表的一篇论文中,德雷塞尔工程学院多媒体和信息安全实验室的研究人员解释说,虽然现有的合成图像检测技术迄今为止在识别人工智能生成的视频方面失败了,但他们在机器学习算法方面取得了成功,该算法可以被训练来提取和识别许多不同视频生成器的数字“指纹”,如稳定视频扩散、视频制作者和Cog video
此外,他们在研究了他们视频的几个例子后,证明了这种算法可以学会检测新的人工智能生成器
德雷塞尔工程学院副教授、MISL主任Matthew Stamm博士说:“在有一个好的系统来检测坏人制造的假货之前,这项视频技术就已经发布了,这让人有点不安。”
“负责任的公司会尽最大努力嵌入标识符和水印,但一旦这项技术公开,那些想利用它进行欺骗的人就会找到一种方法。这就是为什么我们正在努力通过开发从媒体特有的模式和特征中识别合成视频的技术来保持领先。”
Deepfake detectives
Stamm的实验室十多年来一直积极致力于标记数字处理的图像和视频,但该小组在去年特别忙,因为编辑技术被用来传播政治错误信息
直到最近,这些操作一直是添加、删除或移动像素的照片和视频编辑程序的产物;或者减慢、加速或剪辑视频帧。每一次编辑都会留下一个独特的数字面包屑痕迹,斯塔姆的实验室已经开发了一套经过校准的工具来查找和跟踪它们
实验室的工具使用了一种称为约束神经网络的复杂机器学习程序。这种算法可以以类似于人脑的方式,在图像和视频的亚像素级别上学习什么是“正常的”和什么是“不寻常的”,而不是从一开始就搜索特定的预定操纵标识符。这使得该程序既能从已知来源识别deepfakes,又能发现以前未知程序创建的deepfake
神经网络通常根据成百上千个例子进行训练,以很好地了解未经编辑的媒体和被操纵的媒体之间的区别—这可以是任何东西,从相邻像素之间的变化,到视频中帧的间隔顺序,再到文件本身的大小和压缩
A new challenge
“但最近我们看到了像索拉这样的文本到视频生成器,它可以制作出一些令人印象深刻的视频。这些视频带来了一个全新的挑战,因为它们不是由相机制作的或经过PS处理的。”
去年,一则支持佛罗里达州州长罗恩·德桑蒂斯的竞选广告似乎显示,前总统唐纳德·特朗普拥抱并亲吻安东尼·福奇是第一个使用生成人工智能技术的人。这意味着视频不是由其他人编辑或拼接而成的,而是由人工智能程序创建的
斯塔姆指出,如果不进行编辑,那么标准线索就不存在—这给检测带来了独特的问题
“但对于人工智能生成的视频,没有证据表明逐帧进行图像处理,因此检测程序要想有效,就需要能够识别生成的人工智能程序构建视频时留下的新痕迹。”
在这项研究中,该团队测试了11个公开可用的合成图像检测器。每一个项目都非常有效—至少90%的准确率—在识别被操纵的图像方面。但他们的表现下降了20–当面对由公开可用的人工智能生成器、Luma、VideoCrafter-v1、CogVideo和Stable Diffusion Video创建的有辨识度的视频时,30%
他们写道:“这些结果清楚地表明,合成图像检测器在检测合成视频方面遇到了相当大的困难。”。“这一发现在多种不同的检测器架构中保持一致,以及当检测器由他人进行预训练或使用我们的数据集进行再训练时也是如此。”
该团队推测,卷积神经网络检测器,如其MISLnet算法,可能会在合成视频中取得成功,因为该程序旨在在遇到新的例子时不断改变其学习。通过这样做,可以在新的法医痕迹进化过程中识别它们。在过去的几年里,该团队已经证明了MISLnet在发现图像时的敏锐性
"These results clearly show that synthetic image detectors experience substantial difficulty detecting synthetic videos," they wrote. "This finding holds consistent across multiple different detector architectures, as well as when detectors are pretrained by others or retrained using our dataset."
A trusted approach
The team speculated that convolutional neural network-based detectors, like its MISLnet algorithm, could be successful against synthetic video because the program is designed to constantly shift its learning as it encounters new examples. By doing this, it's possible to recognize new forensic traces as they evolve. Over the last several years, the team has demonstrated MISLnet's acuity at spotting images that had been manipulated using new editing programs, including AI tools—so testing it against synthetic video was a natural step.
"We've used CNN algorithms to detect manipulated images and video and audio deepfakes with reliable success," said Tai D. Nguyen, a doctoral student in MISL, who was a co-author of the paper. "Due to their ability to adapt with small amounts of new information we thought they could be an effective solution for identifying AI-generated synthetic videos as well."
For the test, the group trained eight CNN detectors, including MISLnet, with the same test dataset used to train the image detectors, which including real videos and AI-generated videos produced by the four publicly available programs. Then they tested the program against a set of videos that included a number created by generative AI programs that are not yet publicly available: Sora, Pika and VideoCrafter-v2.
By analyzing a small portion—a patch—from a single frame from each video, the CNN detectors were able to learn what a synthetic video looks like at a granular level and apply that knowledge to the new set of videos. Each program was more than 93% effective at identify the synthetic videos, with MISLnet performing the best, at 98.3%.
The programs were slightly more effective when conducting an analysis of the entire video, by pulling out a random sampling of a few dozen patches from various frames of the video and using those as a mini training set to learn the characteristics of the new video. Using a set of 80 patches, the programs were between 95–98% accurate.
With a bit of additional training, the programs were also more than 90% accurate at identifying the program that was used to create the videos, which the team suggests is because of the unique, proprietary approach each program uses to produce a video.
"Videos are generated using a wide variety of strategies and generator architectures," the researchers wrote. "Since each technique imparts significant traces, this makes it much easier for networks to accurately discriminate between each generator."
A quick study
While the programs struggled when faced with the challenge of detecting a completely new generator without previously being exposed to at least a small amount of video from it, with a small amount of fine tuning MISLnet could quickly learn to make the identification at 98% accuracy. This strategy, called "few-shot learning" is an important capability because new AI technology is being created every day, so detection programs must be agile enough to adapt with minimal training.
"We've already seen AI-generated video being used to create misinformation," Stamm said. "As these programs become more ubiquitous and easier to use, we can reasonably expect to be inundated with synthetic videos. While detection programs shouldn't be the only line of defense against misinformation—information literacy efforts are key—having the technological ability to verify the authenticity of digital media is certainly an important step."
想要了解更多关于脑机接口技术的内容,请关注脑机网,我们将定期发布最新的研究成果和应用案例,让您第一时间了解脑机接口技术的最新进展。