We wanted to test Microsoft Cognitive Services to use them in an innovative project. Based on artificial intelligence algorithms, these features made available allow you to recognize audio, faces and only in some cases also the voice of the speaker. In this post we want to provide some code that we used to recognize the speech recorded through a web page using a simple <video> tag. Let’s see how.
The start page has a very simple format consisting of two video tags which serve two purposes. Speech is recorded through the first, while with the second I go to review the recorded video, possibly delete it or send it to Microsoft services to be encoded
Let’s start immediately with the code contained in the JQuery function. After arranging the command buttons we define what will be the parameters to pass to the videodevice objects present in the DOM. let’s start immediately with the code contained in the JQuery function. After arranging the command buttons we go to define what will be the parameters to pass to the videodevice objects present in the DOM. Then we go to use the (readonly) Navigator.mediaDevices property. This property returns an object of type MediaDevices which provides access to connected media input devices like cameras and microphones, as well as screen sharing. Warning! the behavior is different if you are in localhost or in production. In the first case the property will always be valued and made accessible, in the second case you must necessarily be in Https otherwise things, for security reasons, will not work. You will find that things are fine if the browser authorization popup will ask you to use the camera and microphone
After invoking the getUserMedia function by passing it the object with the capture parameters in the callback we receive an object which represents the byte stream coming from our ‘video’ object. This stream will be the value of the srcObject property. The
srcObject property of the media element interface sets or returns the object which serves as the source of the media associated with the element video in our case. The stream will be recorded in an array of bytes declared in the variable chunks. Let me show how this video is saved.
Whenever a byte chunk is made available by the stream it is stored in the array. At the end of the video, the blob is created and will be made available to the second video tag to decide, once the result is seen, whether to use it or delete it. When we have decided that the video is correct, then we launch it to the server that will think about using the Cognitive services and give us back what they have elaborated. Let’s discover how to send a mp4 video
In order to send a video, the simplest thing is to build a Blob object by specifying the raw data of the stream stored in the chunk array and indicating the type of file we are sending. In our case ‘video/mp4‘. Using then an object of type FormData, we can simulate the presence of a form on the page to collect all the input data to be sent to the server. In our case we add the newly created blob indicating the name with which it is recognized by the server. We adjust the headers and wait for the response through a fetch operation which once completed will allow us to deserialize the response Json to see what our Cognitive Services have understood. As a last effect, in addition to inserting the return text in a part of the page, we use the speechSynthesis Browser object which will read the result.
The Server Side (Asp.net Core C#)
We initially return the file to its .mp4 format and then, from the latter, we obtain the audio part by storing it in a .wav file. To do this, you can directly use the .NET version of the FFmpeg libraries (free and completely cross-platform) or rely on third-party libraries that also support the .NET Core version. At this point we have the wav file to submit to our Microsoft Cognitive Services. We have two options as shown in the official documentation. Either completely server-side objects are used which refer to the SpeechRecognizer class, or APIs are used which can therefore be called via an HttpClient, while still providing the same type of information. It goes without saying that it is necessary to have an Azure subscription and to have activated the Speech Recognition service which for a limited volume of requests remains active at very low costs. We have choosen the second one. Let’s see in particular the construction of the http request.
After sending the wav stream, with the headers properly filled in with the Subscription Key obtained when the service was activated on Azure, the server replies with a Json in the format that we illustrate below:
It is interesting to note, from an Artificial Intelligence point of view, how the answer is accompanied by the degree of confidentiality with which our speech has been recognized. It should also be noted that the Endpoint used has as parameter in the querystring the indication that the language to be treated is Italian (?language=it ….)
You can see the finesse with which the Cognitive Services have perfectly identified the name of Alexia (with the X) while the browser’s text-to-speech service still makes a bit of confusion with the correct accent. Obviously this was just a demonstration of how effective and precise the AI services provided by Microsoft are. If you try them you will notice that the degree of confidentiality of the answer is always very high. We also found this in facial recognition services. For us Micorosft has done a really great job.