TI senior engineer's analysis and sharing of voice interface

The voice interface has become a new entry point for changing the way of human-computer interaction. How do these systems work? What are the hardware requirements for creating such a device? As the voice control interface became more and more popular, an engineer at Texas Instruments (TI) had an in-depth understanding of the technology and shared his knowledge and views on the technology.

What is the voice interface?

Speech recognition technology has been around us since the 1950s. At that time, Bell Labs engineers created a system that could recognize a single digit. However, speech recognition is only part of a complete speech interface technology. The voice interface contains all aspects of the traditional user interface: it can present information and provide a way of manipulation for the user. In the voice interface, manipulation or even the presentation of some information will be achieved by voice. On some traditional user interfaces such as buttons or display screens, it is also possible to configure the option of voice interface.

The first voice interface device most people encounter is likely to be a mobile phone, or a very basic program on a personal computer that converts language into text. However, these devices are very slow, inaccurate and have limited vocabulary.

What was it that turned speech recognition from an adjunct feature into the hottest technology in the computer world? First of all, today's computing power and algorithm performance have been significantly improved (if you have some knowledge of hidden Markov models, you will have a more intuitive understanding of this). Secondly, the application of cloud technology and big data analysis has also improved the effect of speech recognition, and improved the speed and accuracy of recognition.

TI senior engineer's analysis and sharing of voice interface

Add speech recognition to your device

Some people often have questions about how to add a certain voice interface to a project. In fact, TI provides several different voice interface products, including the Sitara â„¢ series of ARMÂ® processors and the C5000 â„¢ DSP series, which all have voice processing capabilities. The two series of products have their own advantages and are suitable for different applications.

When choosing two solutions, DSP and ARM, the key factor to consider is whether or not this device will utilize the cloud voice platform. There are three application scenarios: the first is offline, and all processing takes place on the local device. The second is online, through cloud-based voice processing devices such as Amazon's Alexa, Google Assistant or IBM Watson; the third is a hybrid of the two.

Offline: Car voice control

From the current development trend, people seem to hope that everything can be connected to the Internet. However, whether it is due to cost considerations or lack of reliable network connection, in some applications, the significance of connecting to the network is actually not significant. In modern automotive applications, many entertainment information systems use offline voice interface systems. These voice interface systems can usually only use a limited set of commands, such as "make a call", "play music" and "increase or decrease the volume". Although the speech recognition algorithms of traditional processors have made significant progress, there are still some unsatisfactory aspects. In such cases, DSPs such as C55xx may be able to provide the best performance for the system.

Online: Smart Home Hub

Much of the heated discussion about the voice interface revolves around connected devices such as Google Home and Amazon Alexa. Since Amazon allows third parties to enter its voice processing ecosystem that is already equipped with Alex voice services, their development in this area has attracted much attention. In addition, other cloud services such as Microsoft Azur can also provide voice recognition services and similar functions. It is worth noting that the sound processing of these devices all takes place in the cloud.

Whether it is worth providing upstream data to the voice service provider for this convenient integration is entirely up to the user. However, the cloud service provider undertakes the main work, and what the device manufacturer needs to do is very simple. In fact, since the speech synthesis part of the interface also occurs in the cloud, Alexa only needs to complete the simplest function, namely play and record the recording file. Since no special signal processing functions are required, the ARM processor is sufficient to handle the interface work. This means that if your device is equipped with an ARM processor, you may integrate a cloud computing voice interface.

In fact, it is also very important to pay attention to services that Alexa and others cannot provide. Alexa does not directly perform any kind of device control or cloud integration. Many "smart devices" that drive Alexa have cloud computing capabilities, which are provided by developers and can use Alexa's voice processing capabilities to input drivers into existing cloud applications. For example, if you tell Alexa that you need to order a pizza, your favorite pizza shop needs to write a "skill" for Alexa. This skill is a code that defines the job content when you order pizza. Every time you order a pizza, Alexa will call this skill. This skill is embedded in an online ordering system that can place orders for you. Similarly, smart home device manufacturers must implement Alexa's skills on how to interact with local devices and online services. Amazon comes with many of these skills, plus the skills provided by third-party developers, even if you do not develop any skills, Alexa devices can still be very useful.

TI senior engineer's analysis and sharing of voice interface

Hybrid: interconnected thermostat

Sometimes, even if there is no Internet connection, we have the requirement to ensure that certain basic functions of the device can be used normally. For example: when the Internet is not connected, if the thermostat does not adjust the temperature on its own, this will be a very troublesome problem. In order to avoid this kind of problem, a good product designer will design some local sound processing functions to achieve seamless connection in function. In order to achieve this function, the system must have a DSP, such as a C55XX for local voice processing and an ARM processor for connecting the networked interface to the cloud.

What is voice triggering?

You may have noticed that until now we have not mentioned the truly magical part of the new generation of voice assistants: that is, always pay attention to "trigger words". How will they track the sound you make anywhere in the room, or how do you hear your voice when the device plays audio? There is nothing really magical about implementing these, only some intelligent software is needed. This type of software is independent of the voice interface in the cloud, and can also be run offline.

The most understandable part of this system is the "wake word". Wake-up vocabulary is a simple local speech recognition program that uses continuous sampling to find a single vocabulary in the received audio signal. Since most voice services are willing to accept audio without wake-up vocabulary, the vocabulary does not need to specify any special voice platform. Because the requirements for implementing such functions are relatively low, operations can be completed on the ARM processor by using open source databases such as Sphinx or KITT.AI.

In order to hear your voice anywhere in the room, the voice recognition device uses a process called beamforming. The most important thing is to determine the source of the sound by comparing the arrival time of different sounds and the distance between the microphones. Once the location of the target sound is confirmed, the device uses audio processing techniques such as spatial filtering to further reduce noise and enhance signal quality. The implementation of beamforming depends on the layout of the microphone. A truly non-linear microphone array (usually circular) is required to achieve 360-degree recognition. For wall-mounted devices, only two microphones are needed to enable 180-degree spatial discrimination.

The last step of the voice assistant is to use automatic echo cancellation (AEC). AEC is somewhat similar to noise-cancelling headphones, but the application is just the opposite. The algorithm is implemented using known music and other output audio signals. Noise canceling headphones use this to eliminate external noise, AEC eliminates the effect of the output signal on the input signal on the microphone. The device can ignore the audio generated by itself, and no matter what content the speaker plays, it can still receive it. Achieving AEC requires a lot of calculations, among which the effect is the best in DSP.

In order to implement all the above-mentioned functions such as wake-up recognition, beamforming and AEC, the ARM processor is required to work together with the DSP: the DSP enhances all signal processing functions, and the ARM processor controls the device logic and interface. DSP can play an important role in the execution of input data pipelines, thereby minimizing processing delays, thereby providing a better user experience. ARM can freely run advanced operating systems such as Linux to control other devices. Such advanced functions all happen locally. If you use cloud services, you will only receive a single voice file containing the final processing result.

in conclusion

The voice interface seems to have gained super popularity, and will appear in our lives in different forms for a long time in the future. Although there are many different processing methods that can implement voice interface services, no matter what device your application requires, TI can provide you with an ideal choice.

3 In 1 Wireless Charger

3 In 1 Charger,3 In 1 Charging Station,3 In 1 Wireless Charger,3 In 1 Apple Charging Station

wzc , https://www.dg-wzc.com