We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo — in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. See this https URL for example games.
9:00 Sprint Review
Used the LMN tools to figure out what to emphasize and find more papers
Figure out some keywords for various groups and start pulling tweets. I think 10k per group a week would be manageable.
Watching Twitter implde. Maybe I should just use the pushshift API?
By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the “program,” optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at this https URL.
One of the things to add as suggestions is a model-training facility with dedicated staff. The facility exists to train up to very large models that are resilient to attack (think of a GPT-3 ensemble), and staffed with people who study how models fail. The facility also trains faulty models (mode collapse, overfitting, etc) that can be invisibly swapped in for verified (whatever that means) models so that AI pilots can learn to recognize degraded model behavior. Lots of simulators that allow users to be trained in high-stress situations to adapt to failing models.
Since the facility trains many models, it will be possible to train meta models that can understand which hyperparameters and data sets produce effective models, and how to degrade them. This will be extremely valuable as AI/ML continue to move into more roles that were previously occupied by highly trained and/or experienced people.
Find chess paper that shows AI/human tams out-perform AI-only
You must be logged in to post a comment.