A complete internet based communication has always been a domain of large scale enterprises. By a complete internet based communication I imply communication through video, audio, text or email over a single platform. The reason behind this is that of all the communication channels we have encountered so far, each such channel has required an entirely different technology while on the other hand the video and audio communication has often demanded dedicated plug-ins.
Such is not the case anymore. Not after the latest innovation in technology that has arrived in the form of WebRTC API.
In our previous post we had introduced WebRTC and talked about why plug-ins in general are a bad idea. We also developed an understanding as to why WebRTC is going to be the future of internet based communication. So extending that same sentiments, in this post I am going to talk about how you can make your first HTML WebRTC application that can support both audio and video communication.
Now before we begin, the first thing to understand is the point that WebRTC is a web browser API and for it to be functional your browser should support WebRTC. Latest versions of Mozilla Firefox and Google Chrome browsers support WebRTC so you should start your coding on them.
Also be aware, some application platforms claim that they are ‘WebRTC’ enabled but they actually only support getUserMedia, which do not support the rest of the RTC components. So it is vital check the platform carefully before development.
A technical introduction to WebRTC
A WebRTC applications need to do several things:
- Get streaming audio, video or other data.
- Get network information such as IP addresses and ports, and exchange this with other WebRTC clients (known as peers) to enable connection, even through NATs and firewalls.
- Coordinate signalling communication to report errors and initiate or close sessions.
- Exchange information about media and client capability, such as resolution and codecs.
- Communicate streaming audio, video or data.
To implement these functions WebRTC employs three main API:
- MediaStream (aka getUserMedia)
- RTCPeerConnection
- RTCDataChannel
MediaStream: getUserMedia get access to data streams, such as from the user’s camera and microphone. getUserMedia is available in Chrome, Opera and Firefox.
RTCPeerConnection audio or video calling, with facilities for encryption and bandwidth management. It is supported in Chrome (on desktop and for Android), Opera (on desktop and in the latest Android Beta) and in Firefox. A word of explanation about the name: after several iterations, RTCPeerConnection is currently implemented by Chrome and Opera as webkitRTCPeerConnection and by Firefox as mozRTCPeerConnection. Other names and implementations have been deprecated. When the standard process has stabilized, the prefixes will be removed. There’s an ultra-simple demo of Chromium’s RTCPeerConnection implementation at simpl.info/pc and a great video chat application at apprtc.appspot.com.
This app uses adapter.js, a JavaScript shim, maintained Google, that abstracts away browser differences and spec changes.
RTCDataChannel: peer-to-peer communication of generic data. The API is supported by Chrome 25, Opera 18 and Firefox 22 and above.
MediaStream (aka getUserMedia)
The MediaStream API represents synchronized streams of media. For example, a stream taken from camera and microphone input has synchronized video and audio tracks. (Don’t confuse MediaStream tracks with theelement, which is something entirely different.)
Each MediaStream has an input, which might be a MediaStream generated by navigator.getUserMedia(), and an output, which might be passed to a video element or an RTCPeerConnection.
The getUserMedia() method takes three parameters:
- A constraints object.
- A success callback which, if called, is passed a MediaStream.
- A failure callback which, if called, is passed an error object.
Each MediaStream has a label, such as’Xk7EuLhsuHKbnjLWkW4yYGNJJ8ONsgwHBvLQ’. An array of MediaStreamTracks is returned by the getAudioTracks() and getVideoTracks() methods.
For the simpl.info/gum example, stream.getAudioTracks() returns an empty array (because there’s no audio) and, assuming a working webcam is connected, stream.getVideoTracks() returns an array of one MediaStreamTrack representing the stream from the webcam. Each MediaStreamTrack has a kind (‘video’ or ‘audio’), and a label (something like ‘FaceTime HD Camera (Built-in)’), and represents one or more channels of either audio or video. In this case, there is only one video track and no audio, but it is easy to imagine use cases where there are more: for example, a chat application that gets streams from the front camera, rear camera, microphone, and a ‘screenshared’ application.
Do note: getUserMedia can also be used as an input node for the Web Audio API:
function gotStream(stream) { window.AudioContext = window.AudioContext || window.webkitAudioContext; var audioContext = new AudioContext(); // Create an AudioNode from the stream var mediaStreamSource = audioContext.createMediaStreamSource(stream); // Connect it to destination to hear yourself // or any other node for processing! mediaStreamSource.connect(audioContext.destination); } navigator.getUserMedia({audio:true}, gotStream);
Chromium-based apps and extensions can also incorporate getUserMedia. Adding audioCapture and/or videoCapture permissions to the manifest enables permission to be requested and granted only once, on installation. Thereafter the user is not asked for permission for camera or microphone access.
Likewise on pages using HTTPS: permission only has to be granted once for getUserMedia() (in Chrome at least). First time around, an Always Allow button is displayed in the browser’s infobar.
The intention is eventually to enable a MediaStream for any streaming data source, not just a camera or microphone. This would enable streaming from disc, or from arbitrary data sources such as sensors or other inputs.
Note that getUserMedia() must be used on a server, not the local file system, otherwise a PERMISSION_DENIED: 1 error will be thrown.
Signaling: session control, network and media information
WebRTC uses RTCPeerConnection to communicate streaming data between browsers (aka peers), but also needs a mechanism to coordinate communication and to send control messages, a process known as signaling. Signaling methods and protocols are not specified by WebRTC: signaling is not part of the RTCPeerConnection API.
Instead, WebRTC app developers can choose whatever messaging protocol they prefer, such as SIP or XMPP, and any appropriate duplex (two-way) communication channel. The apprtc.appspot.com example uses XHR and the Channel API as the signaling mechanism. The code-lab we built uses Socket.io running on a Node server.
RTCPeerConnection
RTCPeerConnection is the WebRTC component that handles stable and efficient communication of streaming data between peers.
Below is a WebRTC architecture diagram showing the role of RTCPeerConnection. As you will notice, the green parts are complex!
WebRTC architecture diagram (from webrtc.org)
From a JavaScript perspective, the main thing to understand from this diagram is that RTCPeerConnection shields web developers from the myriad complexities that lurk beneath. The codecs and protocols used by WebRTC do a huge amount of work to make real-time communication possible, even over unreliable networks:
- packet loss concealment
- echo cancellation
- bandwidth adaptivity
- dynamic jitter buffering
- automatic gain control
- noise reduction and suppression
- image ‘cleaning’.
The W3C code above shows a simplified example of WebRTC from a signaling perspective. Below are walkthroughs of two working WebRTC applications: the first is a simple example to demonstrate RTCPeerConnection; the second is a fully operational video chat client.
In the real world, WebRTC needs servers, however simple, so the following can happen:
Users discover each other and exchange ‘real world’ details such as names. WebRTC client applications (peers) exchange network information. Peers exchange data about media such as video format and resolution. WebRTC client applications traverse NAT gateways and firewalls.
In other words, WebRTC needs four types of server-side functionality:
1. User discovery and communication.
2. Signaling.
3. NAT/firewall traversal.
4. Relay servers in case peer-to-peer communication fails.