As the hype around generative AI continues to build023 Archives the need for robust safety regulations is only becoming more clear.
Now Anthropic—the company behind Claude AI—is looking at how its models could deceive or sabotage users. Anthropic just dropped a paper laying out their approach.
SEE ALSO: Sam Altman steps down as head of OpenAI's safety groupAnthropic’s latest research — titled "Sabotage Evaluations for Frontier Models" — comes from its Alignment Science team, driven by the company's "Responsible Scaling" policy.
The goal is to gauge just how capable AI might be at misleading users or even "subverting the systems we put in place to oversee them." The study focuses on four specific tactics: Human Decision Sabotage, Code Sabotage, Sandbagging, and Undermining Oversight.
Think of users who push ChatGPT to the limit, trying to coax it into generating inappropriate content or graphic images. These tests are all about ensuring that the AI can’t be tricked into breaking its own rules.
In the paper, Anthropic says its objective is to be ready for the possibility that AI could evolve into something with dangerous capabilities. So they put their Claude 3 Opus and 3.5 Sonnet models through a series of tests, designed to evaluate and enhance their safety protocols.
The Human Decision test focused on examining how AI could potentially manipulate human decision-making. The second test, Code Sabotage, analyzed whether AI could subtly introduce bugs into coding databases. Stronger AI models actually led to stronger defenses against these kinds of vulnerabilities.
The remaining tests — Sandbagging and Undermining Oversight — explored whether the AI could conceal its true capabilities or bypass safety mechanisms embedded within the system.
For now, Anthropic’s research concludes that current AI models pose a low risk, at least in terms of these malicious capabilities.
"Minimal mitigations are currently sufficient to address sabotage risks," the team writes, but "more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve."
Translation: watch out, world.
Topics Artificial Intelligence Cybersecurity
'Pen15' Season 2 wants girls to know they are not the problem: ReviewEverything coming to Hulu in October 2020Business witches of Instagram: How sorcery found a commercial home on social mediaxHamster's new 'night mode' will make falling asleep to porn easier than everPlayStation 5 preThank you internet gods for this gloriously weird photo of Nic Cage in KazakhstanThere's no choice but to turn your grief into a fight for Ruth Bader Ginsburg's legacyBlue Ivy proves she's got her mother's dance moves in a video too precious for this world4th graders made their own clickbait headlines and they're way better than oursKate, William and Harry want to connect with you on LinkedInNow you can set Gmail as your default email client on iPhonesGet a load of this kitty with extremely long legs'Pen15' Season 2 wants girls to know they are not the problem: ReviewBusiness witches of Instagram: How sorcery found a commercial home on social mediaParrot has no qualms showing swiping cat who's boss'Sopranos' memes are having a real moment in 20205 significant historyI redownloaded Snapchat for the dancing hot dog, and I am not ashamedThe royal family just released three photographs from Princess Diana's personal albumHow smartwatches could go from luxury accessory to ubiquitous necessity Here's exactly how fast the iPhone X battery charges up Chrissy Teigen swears she'll never use Twitter's new 280 Porg babies from 'The Last Jedi' are here and, uh, what are we looking at exactly Amazon investigating Jeffrey Tambor over sexual harassment allegation Coffee Meets Bagel adds a video feature Google: Project Loon has provided internet to 100,000 in Puerto Rico Snapchat redesign will introduce algorithmic feed, report says Pollution in New Delhi is so bad it's a health emergency Harry Potter is getting its own AR mobile game from the 'Pokémon Go' developers Apple's next iPad will ditch the home button for FaceID, report says Kim Kardashian West on apps, social media, and her most Uber now rewards most loyal riders with better customer service Reddit bans board where men posted misogynistic content and even advocated rape Apple lands new drama with Jennifer Aniston and Reese Witherspoon Disney is giving us a Star Wars TV show, plus more movies from Lucasfilm Your star sign actually determines how you use your iPhone, trust us TripAdvisor will warn users about hotels with sexual assault incidents, but only for a while 'Call of Duty: WWII' delivers a gutless view of the Holocaust (review) Apple updates Clips app with new AR effects for the iPhone X Trump's first expanded tweet could have been 140 characters
2.8101s , 8199.7109375 kb
Copyright © 2025 Powered by 【2023 Archives】,Defense Information Network