Speech Recognition by Humanoid Robot in Real Environment

- Voice Instruction Picked-Up out of Real Life Noises -

(Translation of AIST press release on June 2, 2005)

Key points

A target speaker is tracked in real time by use of multiple microphones and camera installed in the robot head.
Hearing sense of robot is implemented by extracting voice signals alone out of noisy environment dominated by TV or other noise sources, based on the information of target speaker tracking and sound separation.
Expected to open the way leading to the implementation of natural communications between human beings and humanoid robots in the living environment.

Synopsis

The National Institute of Advanced Industrial Science and Technology (AIST), an independent administrative institution, has developed a speech recognition function in real environment using an array of microphones, successfully extending the sensing capability of humanoid robot under the Humanoid Robotic Project HRP-2 "Prométhée". The microphone array consists of eight omnidirectional microphones mounted around the robot's head (Fig. 1 left). The sound source is located on the basis of difference in times for arrival to individual microphones, and at the same time, a camera mounted at the robot's head detects, tracks and locates a person giving the vocal instruction. Stable speech recognition is obtained by combining information derived from the microphone array and the camera and by isolating and eliminating noises. Hardware to eliminate noises in real time has been developed and built into a robot, making it possible for a human operator to give robot vocal instructions, and to control IT appliances through a robot, even in a field where multiple noise sources such as TV exist.

It is expected, therefore, that natural communications may be realized in the living environment between a human operator and a humanoid robot through the auditory function of robot.

The present study has been carried out as a part of AIST Project "Development of Humanoid Robot Type Intelligence Booster Platform" (fiscal years 2003-05).


Fig. 1. (Left) A head of a humanoid robot equipped with a microphone array. Arrows show positions of mounted microphones. (Right) A multi-channel signal processing hardware built in a robot.

Research Background

Since the announcement of Humanoid Robot P2 by Honda Motor Co., Ltd. in 1996, R&D works on the humanoid robot have been increased energetically not only in Japan but also over the world. In the technological strategic map for robotics drafted by the Ministry of Economy, Trade and Industry (METI), it is planned to ensure practical use of robots supporting human labor in the living environment by 2025, such as supporting household works, self-reliance support, assistance and nursing care for aged persons.

While previous R&D efforts for humanoid robot technology have been focused on robot locomotion aiming at safe and stable walks and behaviors, as well as robot vision, little have been done in full-scale technological development of hearing function of robot, which plays an important role in establishing natural communications between humans and robots.

In the living environment, where practical use of next generation robots is expected, direct human-robot interaction through voice channel is growing to one of key perceptive functions of robot.

History of R&D Work

The AIST is making efforts to realize the advanced and safe IT communities where everyone can freely create, distribute and share necessary data and knowledge by using ever growing sophisticated IT environment. The technological development of human interface to implement natural communications between the machine and the human constitutes one of priority themes, and the humanoid robot is one of realization forms of human interface technology which can make safe collaborative works with human beings in various real environments including household space. Under such a circumstance, the AIST has embarked on a new project "Development of Humanoid Robot Type Intelligence Booster Platform" from fiscal 2003, to develop humanoid robots capable of working safely and securely in diverse environments and keeping natural communications with human beings. The present study has been carried out as a part of this Project.

Details of R&D Works

In the living environment where the next generation robots are expected to work in the near future, a lot of sound sources exist such as TV. Under such a circumstance, the natural communication between the human being and the robot through voice channel just like human-to-human interaction is one of essential functions for robots to work in the living environment. The present study has made it possible to install a voice interface on a humanoid robot operable in the environment involving a lot of sound sources. In this work, the humanoid robot "HRP-2 Prométhée" has been used.

The voice interface developed in this study consists of the following components:

A microphone array system consisting of 8 omnidirectional microphones embedded around the head of the HRP-2.
Software to identify the position of a human being out of an image taken by a wide field camera mounted on the head of the HRP-2.
Software to determine the position of sound source on the basis of difference in arrival times of voice signals to each of microphones in the array, to detect utterance segment and to isolate sound sources through the combination with the visual information of human position supplied from the camera, separating and eliminating noises other than human voices.
Small-sized hardware for multi-channel signal processing to execute these software features in real time (Fig. 1 Right).

Feeding the human voice with noises eliminated by the speech interface into the speech recognition software "Julian" makes it possible for a humanoid robot to stably recognize voice instruction in the environment where TV and other noise sources exist without requiring head set on the part of human operator, establishing robot hearing function.

Moreover, a set of software has been developed to make robots operate through the perceived vocal instruction and manipulate TV and other IT appliances through the network, verifying in this way the usefulness of the voice interface.

▲ ページトップへ

National Institute of Advanced Industrial Science and Technology (AIST)