About the role:
Anthropic’s AI technology is amongst the most capable and safe in the world. However, large language models are a new type of intelligence, and the art of instructing and evaluating them in a way that delivers the best results is still in its infancy — it’s a hybrid between research, engineering, and behavioral science. We’re bringing rigor to this new discipline by applying a range of techniques to systematically discover and document prompting best practices, using our models to improve training and evaluation, developing prompt self-improvement techniques to automatically optimize the model’s performance on any given task, , and finding ways of making it easy for our customers to do the same.
Given that this is a nascent field, we ask that you share with us a specific prompting, model evaluation, synthetic data generation, model finetuning, or application built on LLMs that you're proud of in your application! Ideally this project should show off a complex and clever prompting architecture, or a systematic evaluation of an LLM's behavior in response to different prompts, or an example of using LLMs for a relevant ML task such as careful dataset curation and processing. There is no preferred task; we just want to see how you create and experiment with prompts. You can also include a short description of the process you used or any roadblocks you hit and how to deal with them, but this is not a requirement.
Responsibilities:
- Develop automated prompting techniques for our models (eg extensions to the Metaprompter)
- Finetune new capabilities into Claude that maximize Claude’s performance or ease of use given particular prompting innovations
- Lead automated evaluation of Claude models and prompts across the training and product lifecycle
- Help create and optimize data mixes for model training
- Develop and systematically test new, creative, and original prompting strategies for a wide range of research tasks relevant to our fine-tuning and end product efforts.
- Help to create and maintain the infrastructure required for efficient prompt iteration and testing.
- Develop future Anthropic products built on top of Claude.
- Stay up-to-date with the latest research in prompting and model orchestration, and share knowledge with the team.
You may be a good fit if you:
- Have significant ML research or software engineering experience
- Have at least a high level familiarity with the architecture and operation of large language models.
- Have extensive prior experience exploring and testing language model behavior.
- Have spent time prompting and/or building products with language models
- Have good communication skills and an interest in working with other researchers on difficult prompting tasks.
- Have a passion for making powerful technology safe and societally beneficial.
- Stay up-to-date and informed by taking an active interest in emerging research and industry trends.
- Enjoy pair programming (we love to pair!)
- Advanced degree in computer science, mathematics, statistics, physics, or a related technical field, or an advanced degree in a relevant non-technical field alongside evidence of programming experience.
- Experience with large-scale model training and evaluation.
- Language modeling with transformers
- Reinforcement learning
- Large-scale ETL
Representative projects:
- Building the prompting and model orchestration for a production application backed by a language model
- Finetuning Claude to maximize its performance when a particular prompting technique is used.
- Building and testing an automatic prompt optimizer or automatic LLM-driven evaluation system for judging a prompt’s performance on a task.
- Implementing a novel retrieval, tool use, sub-agent, or memory architecture for language models.
- Building a scaled model evaluation framework driven by model-based evaluation techniques.
Deadline to apply: None. Applications will be reviewed on a rolling basis.