The Valve Index Ear Speakers have been optimized for the specific experiential goals of virtual reality, and this caused their design to diverge in interesting ways from that of typical consumer headphones.
Early in our VR experimentation, it became clear that helping a VR user achieve an appropriate suspension of disbelief* required not only a reliance on the narrative, environmental, and emotional methods of traditional games and films, but also an entirely new category of physiological problem solving unique to VR. When we don a headset to play Budget Cuts, we expect VR to make us feel like our body has been transported into an office full of murderous robots rather than simply showing us their environment through a static screen.
Our research and playtesting led us to understand that achieving maximum sonic immersion enforced as many requirements on the design of the audio components as it did on the 3D tracking system or on the display panels. We also learned that designing around those requirements meant accepting some interesting tradeoffs, which affected things like the position of the speakers, the weight of the driver, the shape of the driver’s diaphragm, the industrial design of the speaker enclosure, and even the fundamental frequency-response characteristics.
*We have called this embodied suspension of disbelief ‘presence’ in other contexts, but that term carries other connotations in the audio world so we won’t use it in the remainder of this Deep Dive article.
Both Hardware and Software
Convincing audio immersion can only be achieved by relying on both software and hardware domains simultaneously. Knowing where to draw the line between the responsibilities of hardware (audio devices) and software (games, VR experiences) required a holistic consideration of the entire VR audio pipeline - from how VR sound content is created, to how it is output by game engines, to all the ways it can reach the ear.
On the software side, game audio engineers and scientists have been working toward creating convincing immersive sound content since the first player-relative panning experiences emerged in the 90’s (Doom, Half-Life, Aureal3D etc). Then, thanks to current generation VR, we’ve seen huge improvements in spatial audio technologies. Binaural rendering and physics based sound simulation plugins such as SteamAudio allow developers to author even greater sonic positional accuracy, physically accurate virtual reverbs, sound occlusion, and propagation all though regular stereo headphones. When considering the optimal listening device for current VR, we leveraged the following knowledge and research in audio software simulations:
- VR content is mostly delivered in stereo - one left and one right audio channel. These channels can contain embedded binaural and HRTF tonal coloration relative to where the player is looking at any given point in time.
- Our outer ears, head shape, and facial geometry add a specific tonal signature that helps our brains identify real sound vs imagined sound, as well as the location of sound sources relative to us (behind, above, below, left, right etc.).
- Mid-high frequency sound fidelity is very important.
- Binaural simulations rely on subtle changes in tonal coloration (1kH-8kHz) to convey the position of a sound source relative to the player. If a listening device adds its own muffled frequency coloration, this will interfere with the player’s ability to localize sound.
- Humans in general are very sensitive to sounds within 2kHZ-5kHz range. If the frequency of a virtual sound doesn’t match up to what we expect it to be in reality, then we are more likely to identify the sound as “not real”. This is particularly true if you compare how easy it is to tell if someone’s voice is broadcast through a speaker vs. someone is talking beside you.
- Low frequency sound fidelity is important.
- While low frequency content doesn’t occur in nature too often, it absolutely appears regularly in VR and entertainment content (music, rumbles, explosions, gunfire, heartbeats, impacts, magic spells, etc.). Bass is critical to convey a sense of size and scale. It augments the visual immersion of VR and elicits certain emotional cues - danger, awe, isolation, internalization, etc. Therefore, it was important that our listening device maintained a healthy amount of bass response.
Why not headphones?
Traditional headphones excel at delivering direct, player-relative, stereo sound content straight to each ear. Players can look in any direction in the virtual world, and 3D game engines with sound simulation plugins will output the required stereo signal to convey the correct location of the virtual sound source. This is the reason competitive e-sport players (eg. CS:GO) use headphones instead of front facing speakers - headphones provide more direct spatial sound information. Two output channels (L/R), for two earpieces (L/R), for two ears (L/R) - straightforward.
Generally speaking, however, traditional audio devices are rarely designed with sonic immersion as the primary goal. Personal devices such as ear buds, on-ear, and over-ear headphones are optimised for listening to music and entertainment in places where loudspeakers aren’t appropriate and often power requirements are extremely low (eg. cell phones, battery powered devices). The focus is often on sound isolation, power efficiency, noise cancelling, and exaggerated frequency responses. We felt that many of these optimizations might not make as much sense in the context of current room-scale VR, where the general listening environment is a dedicated volume of space (eg. an indoor room with light background ambience), where a tiny amount of sound leakage may be fine. We have access to plenty of power, and frequency responses need to support the assumptions of binaural sound simulations.
Headphones and earbuds need to make contact or surround the ear in order to achieve their goals optimally. We saw that this can sometimes work against audio immersion in the following ways:
- Delivering sound directly into the ear canal bypasses the natural listening process caused by ear and head interacting with real sound waves. Listeners miss out on the tonal sound signature created by the ears, head and personal geometry. This can result in sound appearing as though it’s imagined, or it’s coming from inside one’s head, even if the audio content is highly spatial and physically simulated. We predict that eventually software simulations will take this into account.
- Ear pressure can get painful and uncomfortable after periods of time, drawing people out of VR presence.
- Some playtesters reported that the very act of headphones touching the ear signaled to them that any forthcoming sound wasn’t going to be real.
- Sealing the ear with over-ear headphones can trap heat - making VR headsets feel hotter than the user would in real life, reducing presence.
- Tonal sound quality of some headphones can interfere with the subtle frequency colorations of binaural simulations. For example,headphones where mid-high frequencies are either exaggerated or muffled will most likely interfere with the subtleties of HRTF filters, resulting in a poor sense of directional sound in games and VR.
Why not loudspeakers?
We also considered consumer loudspeakers and beam-forming speakers in typical stereo or surround sound setups. Loudspeakers avoid many of the comfort issues associated with headphones and emit sound we can easily perceive as external to our own heads, but they did pose several obstacles to adoption:
- Existing stereo loudspeaker configurations assume a front-facing orientation, so sound is played back as if one is in an audience, listening to a band on a stage, or watching TV from a couch. This is ok for music and film on a screen, however, VR and stereo game content is output assuming L/R channels are arriving immediately to each side of the listener's head.
- Common 5.1 and 7.1 surround sound systems restrict playback to a horizontal field, where as VR and game sound content can be virtually positioned anywhere around the listener.
- Loudspeaker systems can take time and space for the user to set up correctly, creating additional friction for VR setup.
- Loudspeakers require the player to remain within a small “sweet spot” for accurate spatial playback. VR can sometimes require people to move around in a large space.
- Loudspeakers can be influenced by the acoustics of the real room, which may conflict with the desired acoustics of the virtual world.
- Loudspeakers may make a sound feel too far away, contradicting the location of a virtual sound source that might be very close to the players ears.
In reviewing all the tradeoffs above, it became apparent that the optimal solution for VR might be a pair of ultra near-field, full range, off-ear (extra-aural) headphones. Close enough to the ear to mimic player-relative stereo headphones and support the output format of current VR content, but far enough away to allow the ears and head to imprint their own tonal coloration onto the sound, while also addressing comfort and pressure issues. It was this realization combined with inspiration from a childhood memory of being completely sonically immersed while laying between two inward facing hi-fi speakers, that resulted in the first prototypes being created.
The first prototype was made by taping two small full-range desktop speaker drivers to the side of a skateboard helmet. An old Vive was strapped around the outside of the helmet. The speakers were powered by USB and audio output via the headphone jack on the HTC Vive. This crude prototype did a surprisingly good job at demonstrating the increase in sonic immersion and externalization when we allowed our own ears and head to interpret sound naturally. The feeling of immersion is hard to measure quantitatively, so at this stage we relied on qualitative feedback from colleagues and playtesters to describe the sonic difference between this prototype and a pair of KOSS Porta Pro on-ear headphones while in VR. The responses were significantly enough in favor of the speakers that we felt comfortable proceeding with this design. However, several issues arose:
- Very limited bass response.
- Slight variations in speaker positions caused by putting on the helmet differently, or moving around in VR caused the volume, frequency response and sound balance to shift significantly.
- Weight and size. The speakers were too heavy - (70g each) which was at odds with the greater product goal of making our Headset light and comfortable. This was probably the biggest concern early on.
- Sound leakage.
To address weight concerns we investigated using headphone drivers instead of speaker drivers. While lighter, and more power efficient, they couldn’t deliver enough volume when held away from the ear in free air. Even though we already knew this would be the case, it was interesting to hear the tradeoffs between sound immersion vs. distance away from the ear vs. frequency response and volume.
We wanted to know just how big headphone drivers needed to be in order to start meeting our volume and frequency response requirements in our extra-arual context. We talked with Audeze who developed a pair of magnetic planar extraaural headphones in order to help us find out. The result sounded incredible, however, the weight, size, and cost were not feasible for the production goals of the Valve Index.
We returned to using speaker drivers as a basis for our design moving forward. One of the benefits of early audio R&D at this stage was being able to work independently of the rest of the Valve Index Headset system. With the help of a mechanical engineer we created a stand-alone extra-aural headphone form factor. In this new context, we were able to quickly iterate on bass response, tuning, orientation to the ear, distance from the ear, and A/B test speaker driver evaluations. This prototype was the first 3D printed ear-speaker headphones. Internally we call them “Hummingbirds.”
These colorful Hummingbirds were created with the goal of evaluating different kinds of small full-range speaker drivers. Up until this point, we had been repurposing entire consumer speaker and headphone systems. Purchasing and evaluating off-the shelf parts required us to start building the basics of the audio subsystem: amps, audio chips, DSP (digital signal processing), and microphones. In parallel, we were getting closer to defining our shipping targets for optimal distance from the ear, rotation, weight, speaker dimensions, and frequency response.
We came across BMR (Balance Mode Radiator) speakers during our driver evaluation stage and immediately noticed several positive benefits: They reduced coloration due to speaker mispositioning, were almost within range of our weight target, had great frequency response in high-mid ranges (important for binaural simulations), and were much thinner than traditional speaker drivers. We began working with Tectonic to design a custom driver for use as an off-ear speaker.
Internally, concerns were increasing about how much sound ear speakers would leak into the environment, as well as how much sound they might let in. To get a sense of how impactful this might be to customers, we built 20+ hummingbirds and lent them out for colleagues to test at home. No one wanted to return their Hummingbirds (Chet). This was a good sign along with the overwhelmingly positive playtest feedback. Playtesters commented that the benefits of not having anything touching the ear, and the increased sense of sound immersion was offsetting issues caused by external sound coming in and/or internal sound leaking out. We decided to proceed with this design but keep these concerns in mind.
We now had a working ear speaker subsystem that was playtesting well and was within the ballpark of our fidelity, cost, and design targets. We began the process of merging the ear speaker design with the Valve Index Headset. Here it became important to start acoustically measuring the performance of our audio subsystem in the context of the headset. Taking accurate measurements allowed us to capture incremental improvements as well as identify issues in the audio subsystem. Initially we used “Mr. HATS”, our dummy head model used for taking frequency response measurements of our ear speakers. Blue tape on the face marks the exact placement of the HMD on the model so early measurements could remain consistent.
To maximize the sound quality, the frequency response and bass extension were measured and refined on a daily basis. While we at Valve were working to improve bass extension through DSP using EQ tuning and algorithms like psychoacoustic bass, Tectonic was working to improve the bass mechanically by optimizing the speaker driver itself. The combination of these efforts allowed us to achieve and exceed our sound quality and bass response goals.
By using BMR drivers, we are able to ensure consistent sound quality, without coloration, even if the speakers are slightly mispositioned on the side of the head. This is due to the unique way that BMRs radiate sound. At low frequencies they behave like traditional speakers. The electrical signal comes in, and the entire diaphragm (front part of the speaker) moves back and forth tracing the shape of the signal. However, the real magic happens at higher frequencies. When the wavelength of the bending waves travelling through the diaphragm is similar to the size of the diaphragm, traditional drivers start to go into 'break-up' modes which cause the diaphragm to bend and ripple, creating very sharp peaks and dips in the frequency response that, in addition to sounding bad, are very placement sensitive. BMRs are designed to exploit the natural behavior of the diaphragm, balancing the vibrations from different areas through optimized material selection, mass loading and extensive design simulation. Basically, ensuring that your ears always receive the full sound information - even if they are not perfectly aligned with the BMR speakers.
Additionally, Tectonic was also able to mechanically minimize sound leakage. Because the Valve Index speaker drive unit is open backed, the pressure from the front side can interact with the pressure from the rear side, and these are out of phase with each other by definition. However, the drive unit itself provides a degree of “self-baffling” via its total diameter. Essentially, for any speaker drive unit, its outer diameter helps to prevent the pressure from the front side meeting the pressure from the rear side. Though this only helps when the wavelength of the sound waves in the air is smaller than the self-baffle of the driver’s diameter. When the wavelength becomes larger than the driver’s diameter, the pressure from the front side will directly interact with the pressure from the rear side, and strong cancellation happens. The overall diameter of the drive unit is about 5cm. This means that above about 3kHz there is no cancellation but, as we know, there is increasingly less audio content above this frequency. Most audio content exists below 3kHz, and this is where the cancellation is strong, preventing sound from bothering people nearby. “The listener wearing the headset has their ears so close to the drive unit (near-field) that the cancellation is not perceived as the pressure from the front side is RELATIVELY so much closer to the ear than the rear side.” Tim Whitwell, CTO Tectonic.
Our goal to have high quality microphones on the headset to support streamers and multiplayer experiences was easy to define. However, due to the off-ear speaker design, we expected the microphone performance to be a difficult challenge. To our surprise, this turned out to not be the case. Due to several of the ear speaker’s unique features, we were able to avoid using a significant amount of noise cancelling DSP on the microphone signal, which in turn allowed us to keep the sampling rate of the microphone stream very high at 48kHz. This is a list of features that helped build high quality microphone input:
- Dual microphone array to narrow directional response and focus on the signal (user’s mouth) and eliminate other extraneous noise. The dual-directional microphone array focuses the pickup on the user’s mouth and excludes any external sound.
- The “self-baffling” of the BMR drivers reduces external noise pollution much more than traditional speakers.
- Speaker and microphone acoustics were designed to greatly reduce any non-linear acoustic feedback paths. The player’s own head absorbs much of the initial sound energy from the BMR speakers.
- High SNR microphones and audio paths.
- Good quality microphones and acoustic seals.
- Dynamic compression of incoming audio to avoid clipping loud voices.
All this research, iteration and feedback leads us to believe that the Valve Index ear speaker design is as close to an optimal balance of tradeoffs and features specifically designed for audio playback in room-scale VR as currently possible. We’re really pleased with how the audio experience played out and, that said, there is still much more to learn and more improvements we can make.