Skip to content

Latest commit

 

History

History
28 lines (17 loc) · 6.26 KB

README.md

File metadata and controls

28 lines (17 loc) · 6.26 KB

AR

Augmented Reality solution for sound visualization

Reduced sound awareness has wide-ranging impacts for deaf or hard of hearing (DHH) individuals, from missing critical notifications such as a ringing fire alarm to inconveniences like a phone ringing at an inappropriate time. We seek to overcome these obstacles to communication with enhanced augmented reality technology. The purpose of our project is to merge computer data and the real world, resulting in a mixed-reality environment that will provide hearing impaired users with a graphical visualization of sound data. This will be accomplished by harnessing optical hardware to combine computer-generated images displaying sound intensities, transcriptions, and sentiment state with what the user is experiencing in real life. The product will be a headset-mounted display consisting of an optical system and microphones that will overlay these graphical visualizations of sound onto the user’s view of their environment in real time. The combination of audio data graphics and the user’s perception of their real environment will constitute an augmented reality that enhances the user’s ability to interact with the world around them.

alt text

Software

The software design will consist of four major classes: the device, the object detection and tracking, the audio processing, and the graphic generator. The software sequence diagram, shown in Figure 3, depicts the interactions between the user and each class. Essentially, the user will open the application, and the device class will access the external camera and microphones, getting the video and audio stream. The object detection class will detect any faces in the video frames, while the audio processor class will perform some audio separation and localization. After localization and separation, audio transcription and processing will occur. Face coordinates and audio information will be sent to the graphics generator. The graphics generator will perform sentiment analysis on the provided data and create the necessary graphics. Finally, the graphics will be displayed back to the user.

alt text

Camera Tracking

By identifying the coordinates of faces in the user’s frame of vision, we can make sure to dynamically place all of our graphics in free space. The tracking module will monitor the live camera feed recorded by the HMD and recognize key faces in the frame. The resulting coordinates pointing to free space will serve as anchor points for the audio visualization graphics. The first consideration in developing the tracking module was to decide between a marker-based or marker-less implementation. Marker tracking trains the tracking function using specific fiduciary markers, while marker-less tracking is based on recognizing natural features. After consulting Dr. Atlas Wang, a UT professor who specializes in computer graphics and computer vision, we decided to pursue marker-less tracking. However, we must keep in mind that current marker-less tracking technologies still require a trade-off between precision and efficiency. The tracking module was built on an OpenCV algorithm for face detection using Haar Cascades for its wide-ranging applications in image processing and extensive documentation. This algorithm is a machine learning based approach where several interdependent functions are trained from a large dataset of positive and negative images to detect faces in other images. Developing the module to recognize faces in a frame involved training an algorithm against many large and diverse datasets and calibrating it to receive a live image feed from a camera.

alt text

Audio Transcription

The audio processing class will be coded in python and have three main functions: localization/separation, intensity/amplitude extraction, and live transcription. These functions will process an audio stream in real-time and output audio data that will be received and utilized by the graphics generation class. For localization/separation, we potentially hope to use the Cone of Silence method created by the University of Washington. Their algorithm involves isolating sources within an angular region,

angular region = θ ± w/2,

where θ is the angle of interest and w is the window of interest. By exponentially decreasing w and performing a binary search, the different audio sources can be localized and separated. Once the audio sources have been isolated, the audio streams will be handled using the PyAudio library. The PyAudio library will enable us to extract features such as sound intensity and amplitude. This library will also enable us to pass the audio stream through Google Cloud’s Speech-to-Text API. Google Cloud’s API uses powerful machine learning models which our design will implement to create accurate live transcription of speakers.

Near-Eye Optics

One of the main engineering challenges of this project is the modeling and execution of a see-through near-eye display. This display will project the generated images and combine them with light coming from the real world. While there are many highly engineered, different solutions for near-eye displays, most share three common elements: A micro display, optical components, and a combiner. The micro display provides the light for the computer-generated image, but is generally not facing the eye. The light from the display is then manipulated with optical components – lenses, mirrors, and polarizers can be used. These components transform the light from the display so it is more palatable to the eye. Finally, a combiner is used to merge the display light with light from the environment. To fabricate an effective display for the headset, we will need to perform extensive simulation of the optical system to ensure the final image will be large, focused, and sufficiently bright. After simulation, the optical housing needs to be structurally rigid so that the optics perform consistently. The near-eye display is one of the most involved problems in this project, and its performance is crucial to the overall quality of our final product.

alt text