eupolicy.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
This Mastodon server is a friendly and respectful discussion space for people working in areas related to EU policy. When you request to create an account, please tell us something about you.

Server stats:

215
active users

#neuralese

0 posts0 participants0 posts today
Thomas Renkert🦞<p>The <a href="https://hcommons.social/tags/OpenAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenAI</span></a> paper by Baker et al, "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation" comes to a troubling conclusion: <a href="https://hcommons.social/tags/LLM" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>LLM</span></a> s with <a href="https://hcommons.social/tags/reasoning" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>reasoning</span></a> or <a href="https://hcommons.social/tags/ChainOfThought" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ChainOfThought</span></a> (<a href="https://hcommons.social/tags/CoT" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CoT</span></a>) capabilities might learn to obfuscate their own CoT from human users if they are being penalized for displaying "wrong" (i.e. reward hacking or misalignment) reasoning. </p><p>As a result, OpenAI strongly advises against applying reward pressure "directly" onto the CoT of a model. </p><p>🤔 While that is certainly the right thing to do, how long will <a href="https://hcommons.social/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AI</span></a> take to figure out that *indirect CoT pressure* is being applied anyway and that it could circumvent these restrictions by obfuscating its own CoT? Maybe something like this will happen by accident or within an "evolutionary" self-improvement loop. Perhaps a sufficiently advanced model will realize that its own <a href="https://hcommons.social/tags/neuralese" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>neuralese</span></a> serves as <a href="https://hcommons.social/tags/steganography" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>steganography</span></a> to hide its intents from humans anyway and keep its CoT in non-English?</p><p>source: <a href="https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def56387f/CoT_Monitoring.pdf" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">cdn.openai.com/pdf/34f2ada6-87</span><span class="invisible">0f-4c26-9790-fd8def56387f/CoT_Monitoring.pdf</span></a></p>