#neuralese

Thomas Renkert🦞The <a href="https://hcommons.social/tags/OpenAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#OpenAI</a> paper by Baker et al, "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation" comes to a troubling conclusion: <a href="https://hcommons.social/tags/LLM" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#LLM</a> s with <a href="https://hcommons.social/tags/reasoning" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#reasoning</a> or <a href="https://hcommons.social/tags/ChainOfThought" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#ChainOfThought</a> (<a href="https://hcommons.social/tags/CoT" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#CoT</a>) capabilities might learn to obfuscate their own CoT from human users if they are being penalized for displaying "wrong" (i.e. reward hacking or misalignment) reasoning. As a result, OpenAI strongly advises against applying reward pressure "directly" onto the CoT of a model. 🤔 While that is certainly the right thing to do, how long will <a href="https://hcommons.social/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AI</a> take to figure out that *indirect CoT pressure* is being applied anyway and that it could circumvent these restrictions by obfuscating its own CoT? Maybe something like this will happen by accident or within an "evolutionary" self-improvement loop. Perhaps a sufficiently advanced model will realize that its own <a href="https://hcommons.social/tags/neuralese" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#neuralese</a> serves as <a href="https://hcommons.social/tags/steganography" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#steganography</a> to hide its intents from humans anyway and keep its CoT in non-English?source: <a href="https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def56387f/CoT_Monitoring.pdf" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def56387f/CoT_Monitoring.pdf</a>

Recent searches

Search options

Administered by:

Server stats: