When Amazon’s first Alexa-enabled smart speaker debuted in 2014, it was new – a voice-activated natural language processing interface that could perform a number of simple tasks.
Fast forward to today, and the Internet-connected platform has rapidly grown into its own electronic ecosystem. With tens of thousands of Alexa-enabled devices available and hundreds of millions of units sold, Alexa has become almost ubiquitous as a virtual assistant.
But while Alexa is now integrated into everything from TVs to microwaves to headphones, Amazon’s vision for ambient computing is still in its infancy. While huge strides have been made in natural language processing and other areas of artificial intelligence to work for a potential market of billions of users, there is still a long way to go.
For the future, Amazon ultimately wants to make these devices capable of understanding and supporting users almost as well as a human assistant. But to do this, significant advances must be made in several areas, including contextual decision-making and reasoning.
To dig deeper into the potential of Alexa and ambient computing in general, I interviewed Alexa Senior Vice President and Chief Scientist Rohit Prasad about the future of the platform and Amazon’s goals. for the increasingly intelligent virtual assistant platform.
Richard Yonck: Alexa is sometimes called “ambient computing”. What are some examples or use cases for ambient AI?
Rohit Prasad: Ambient computing is technology that is there when you need it and goes in the background when you don’t. He anticipates your needs and makes your life easier by always being available without being intrusive. For example, with Alexa, you can use routines to automate your home, like turning on your lights at sunset, or you can use Alexa Guard to have Alexa proactively warn you if it detects sounds like a broken light. glass or smoke detector.
Yonck: During your recent CogX presentation, you mentioned that Alexa “goes into reasoning and self-reliance on your behalf”. What are some examples of this in the near future compared to where we are now?
Prasad: Today we have features like Hunches, with Alexa suggesting actions to take in response to abnormal sensor data, ranging from alerting you that the garage door is open when you go to bed, to a convenient rearrangement when your printer ink is low. More recently, Ring Video Doorbell Pro owners can choose to have Alexa act on their behalf, greet visitors, and offer to take a message or provide instructions for package delivery.
Overall, we’ve made progress towards more contextual decision-making and made initial progress in reasoning and self-reliance through self-learning, or Alexa’s ability to improve and expand her abilities without human intervention. . Last year we took another step forward with a new Alexa feature that can infer a customer’s latent goal. Suppose a client requests the weather at the beach, Alexa can use the request, in combination with other contextual information, to infer that the client may be interested in a trip to the beach.
Yonck: Edge Computing is a way to do some of the computation near the device rather than in the cloud. Do you think that Alexa processing can potentially be done at the edge to reduce latency enough, support federated learning, and address privacy concerns?
Prasad: From the moment we introduced Echo and Alexa in 2014, our approach has combined cloud, on-device and edge processing. The relationship is symbiotic. The location of the IT will depend on several factors, including connectivity, latency, and client privacy.
As an example, we understood that customers would want basic functionality to work even if they lose network connectivity. As a result, in 2018, we launched a hybrid mode where smart home intentions, including controlling lights and switches, would continue to work even if connectivity was lost. This also applies to using Alexa on the go, including in the car where connectivity can be intermittent.
In recent years, we have applied various techniques to make neural networks efficient enough to run on the device and minimize the memory and computational footprint without losing precision. Now, with neural accelerators like our AZ1 Neural Edge processor, we’re pioneering new customer experiences like natural turn taking, a feature we’ll be bringing to customers this year that uses algorithms on the device. to merge acoustic and visual cues to infer whether the participants in a conversation are interacting with each other or with Alexa.
Yonck: You have described several features that we need in our social bots and task bots in your AI pillars for the future. Can you share the timelines for any of them, even if they are general?
Prasad: Open domain multi-turn conversations remain an unresolved issue. However, I am excited to see college students advancing conversational AI through the tracks of the Alexa Prize competition. Participating teams improved the state of the art by developing a better understanding of natural language and dialogue policies leading to more engaging conversations. Some even worked on recognizing humor and generating humorous responses or selecting contextually relevant jokes.
These are tough AI problems that will take time to resolve. While I think we have 5-10 years to reach the goals of these challenges, one area that I’m particularly passionate about in conversational AI is where the Alexa team recently received the award for best paper: Explicitly Bringing In common sense knowledge graphics. and implicitly in large pre-trained language models to give machines greater intelligence. Such work will make Alexa more intuitive and intelligent for our customers.
Yonck: For open domain conversations, you mentioned combining transformer-based neural response generators with knowledge selection to generate more engaging responses. Very briefly, how is the selection of knowledge carried out?
Prasad: We’re pushing the boundaries with open domain conversations, including through the Alexa Prize SocialBot Challenge where we’re continually inventing for participating college teams. One of these innovations is a neural transformer-based language generator (i.e. Neural Response Generator or NRG). We have extended NRG to generate even better responses by incorporating a policy dialogue and merging global knowledge. The policy determines the optimal form of the response – for example, if applicable, the AI’s next turn should recognize the previous turn and then ask a question. To integrate knowledge, we index publicly available knowledge on the web and retrieve the sentences most relevant to the context of the dialogue. NRG’s goal is to produce optimal responses that conform to policy making and include knowledge.
Yonck: For more naturalness, you ideally want to have a large contextual basis for conversations. Learn, store and have access to a huge amount of personal information and preferences in order to provide each user with unique personalized responses. It seems very computationally intensive and in storage. Where is Amazon’s hardware now versus where it needs to be to finally get there?
Prasad: This is where peripheral processing comes in. To provide the best customer experience, some processing, such as computer vision to determine who in the room is talking to the device, must be done locally. This is an active area of research and invention, and our teams are working diligently to make machine learning – both inference and model updates – more efficient on the device. In particular, I am excited about the large, pre-trained, deep learning-based models that can be effectively distilled for effective processing at the periphery.
Yonck: What do you think is the biggest challenge in achieving a fully developed ambient AI, as you described?
Prasad: The biggest challenge in realizing our vision is to move from reactive responses to proactive assistance, where Alexa is able to detect anomalies and alert you (for example, a hunch that you left the garage door open) or anticipate your needs to achieve your latent goals. While AIs can be pre-programmed for such proactive support, this won’t be scalable given the myriad of use cases.
Therefore, we need to move towards a more general intelligence, that is, the ability of an AI to: 1) multitask without requiring significant task-specific intelligence, 2) adapt to variability within a set of known tasks, and 3) learn completely new tasks.
In the context of Alexa, this means that it is more about self-learning without requiring human supervision; more selfish by making it easier to integrate Alexa into new devices, drastically reducing the burden on developers to create conversational experiences, and even allowing customers to personalize Alexa and directly teach new concepts and personal preferences; and more aware of the surrounding state to proactively anticipate customer needs and transparently support them.