|July 4th, 2017|
In Dario's view, the research that's most valuable from an AI safety perspective also has substantial value from the perspective of solving problems today, and can be productively worked on in the same manner as any other area of machine learning (ML) research. He started by explaining the situation around goals, as an illustration of the kind of work he'd like to see and where it fits in.
To be intelligent, you kind of do three things: make predictions about your environment, take actions based on those predictions, and have and execute complex goals. Historically, most research effort has gone into the first two. Which makes sense: there are many difficult problems where we can specify the goals very simply—recognize handwriting, win at Go, classify the subjects of images—and we've made lots of progress this way.
On the other hand, there are also lots of cases where it's hard to specify goals. Maybe we know what would be a good solution but don't know how to code that as a reward function. Right now this is mostly a limitation on our ability to apply learning systems to problems, but if the prediction and action aspects of ML get far enough ahead of the reward aspect it could be dangerous. Much of the risk of things going wrong, from Dario's perspective, is that if specifying complex goals is pretty new to us when we get to AGI we might not have enough experience to get it right. Instead he would like to see us prioritize goal work now, and help the reward side keep up with the rest of ML.
Not surprisingly, given how pressing he sees this as, that's one of the things his group is working on. For example, in their recent paper, Deep Reinforcement Learning from Human Preferences (blog post, pdf), they train systems for several tasks by asking humans to compare pairs of short video clips and pick which one is better. Instead of asking people for feedback constantly, they train a model to predict the human's judgements, which means they can have the system ask for feedback just in situations where it's most uncertain. This is only a step, but it's an example of the kind of work he thinks we need more of.
Two other examples he gave of the kind of areas he'd like to see more work in were transparency (understanding how the system gives the answers it does; example) and adversarial examples (inputs chosen to make the system screw up; example).
He also wanted to emphasize that he thinks AI safety work today should aim to be valuable on its own as ML research that allows us to perform new tasks, and would be valuable even if we didn't consider long-term safety. The idea is this helps ground the work, give it an empirical feedback loop, and make it more likely to be useful in the long run. 
At this point I was wondering: since industry also cares a lot about whether learning systems do what we want them to do, is this really something we need to do for altruistic reasons? Dario's response was that transparency and safety are difficult research areas and, while they do pay off in the short run, they pay off more in the long run, so will tend to be underinvested in. There are also many more promising research directions than researchers right now, so what ends up getting explored is highly dependent on what researchers are interested in.
Additionally, the sooner you think we might have AGI, the more important this work is. If it takes us long time to get AGI, perhaps because we need a lot more computational power or many new learning breakthroughs, there's more likely to have been enough time make more balanced progress. If it takes a short time, however, perhaps because it turns out that all we need for AGI is extensions of what we have today and a lot of scaling, then the risk is much higher.
 This lines up strongly with my 2015 thoughts on why I didn't think (mathematical-style) AI safety work was likely to be productive.