Thursday, August 31, 2017

IoT Extended Sensoria

bbvaopenmind |  In George Orwell’s 1984,(39) it was the totalitarian Big Brother government who put the surveillance cameras on every television—but in the reality of 2016, it is consumer electronics companies who build cameras into the common set-top box and every mobile handheld. Indeed, cameras are becoming commodity, and as video feature extraction gets to lower power levels via dedicated hardware, and other micropower sensors determine the necessity of grabbing an image frame, cameras will become even more common as generically embedded sensors. The first commercial, fully integrated CMOS camera chips came from VVL in Edinburgh (now part of ST Microelectronics) back in the early 1990s.(40) At the time, pixel density was low (e.g., the VVL “Peach” with 312 x 287 pixels), and the main commercial application of their devices was the “BarbieCam,” a toy video camera sold by Mattel. I was an early adopter of these digital cameras myself, using them in 1994 for a multi-camera precision alignment system at the Superconducting Supercollider(41) that evolved into the hardware used to continually align the forty-meter muon system at micron-level precision for the ATLAS detector at CERN’s Large Hadron Collider. This technology was poised for rapid growth: now, integrated cameras peek at us everywhere, from laptops to cellphones, with typical resolutions of scores of megapixels and bringing computational photography increasingly to the masses. ASICs for basic image processing are commonly embedded with or integrated into cameras, giving increasing video processing capability for ever-decreasing power. The mobile phone market has been driving this effort, but increasingly static situated installations (e.g., video-driven motion/context/gesture sensors in smart homes) and augmented reality will be an important consumer application, and the requisite on-device image processing will drop in power and become more agile. We already see this happening at extreme levels, such as with the recently released Microsoft HoloLens, which features six cameras, most of which are used for rapid environment mapping, position tracking, and image registration in a lightweight, battery-powered, head-mounted, self-contained AR unit. 3D cameras are also becoming ubiquitous, breaking into the mass market via the original structured-light-based Microsoft Kinect a half-decade ago. Time-of-flight 3D cameras (pioneered in CMOS in the early 2000s by researchers at Canesta(42) have evolved to recently displace structured light approaches, and developers worldwide race to bring the power and footprint of these devices down sufficiently to integrate into common mobile devices (a very small version of such a device is already embedded in the HoloLens). As pixel timing measurements become more precise, photon-counting applications in computational photography, as pursued by my Media Lab colleague Ramesh Raskar, promise to usher in revolutionary new applications that can do things like reduce diffusion and see around corners.(43)

My research group began exploring this penetration of ubiquitous cameras over a decade ago, especially applications that ground the video information with simultaneous data from wearable sensors. Our early studies were based around a platform called the “Portals”:(44) using an embedded camera feeding a TI DaVinci DSP/ARM hybrid processor, surrounded by a core of basic sensors (motion, audio, temperature/humidity, IR proximity) and coupled with a Zigbee RF transceiver, we scattered forty-five of these devices all over the Media Lab complex, interconnected through the wired building network. One application that we built atop them was “SPINNER,”(45) which labelled video from each camera with data from any wearable sensors in the vicinity. The SPINNER framework was based on the idea of being able to query the video database with higher-level parameters, lifting sensor data up into a social/affective space,(46) then trying to effectively script a sequential query as a simple narrative involving human subjects adorned with the wearables. Video clips from large databases sporting hundreds of hours of video would then be automatically selected to best fit given timeslots in the query, producing edited videos that observers deemed coherent.(47) Naively pointing to the future of reality television, this work aims further, looking to enable people to engage sensor systems via human-relevant query and interaction.

Rather than try to extract stories from passive ambient activity, a related project from our team devised an interactive camera with a goal of extracting structured stories from people.(48) Taking the form factor of a small mobile robot, “Boxie” featured an HD camera in one of its eyes: it would rove our building and get stuck, then plea for help when people came nearby. It would then ask people successive questions and request that they fulfill various tasks (e.g., bring it to another part of the building, or show it what they do in the area where it was found), making an indexed video that can be easily edited to produce something of a documentary about the people in the robot’s abode.
In the next years, as large video surfaces cost less (potentially being roll-roll printed) and are better integrated with responsive networks, we will see the common deployment of pervasive interactive displays. Information coming to us will manifest in the most appropriate fashion (e.g., in your smart eyeglasses or on a nearby display)—the days of pulling your phone out of your pocket and running an app are severely limited. To explore this, we ran a project in my team called “Gestures Everywhere”(49) that exploited the large monitors placed all over the public areas of our building complex.(50) Already equipped with RFID to identify people wearing tagged badges, we added a sensor suite and a Kinect 3D camera to each display site. As an occupant approached a display and were identified via RFID or video recognition, information most relevant to them would appear on the display. We developed a recognition framework for the Kinect that parsed a small set of generic hand gestures (e.g., signifying “next,” “more detail,” “go-away,” etc.), allowing users to interact with their own data at a basic level without touching the screen or pulling out a mobile device. Indeed, proxemic interactions(51) around ubiquitous smart displays will be common within the next decade.

The plethora of cameras that we sprinkled throughout our building during our SPINNER project produced concerns about privacy (interestingly enough, the Kinects for Gestures Everywhere did not evoke the same response—occupants either did not see them as “cameras” or were becoming used to the idea of ubiquitous vision). Accordingly, we put an obvious power switch on each portal that enabled them to be easily switched off. This is a very artificial solution, however—in the near future, there will just be too many cameras and other invasive sensors in the environment to switch off. These devices must answer verifiable and secure protocols to dynamically and appropriately throttle streaming sensor data to answer user privacy demands. We have designed a small, wireless token that controlled our portals in order to study solutions to such concerns.(52) It broadcast a beacon to the vicinity that dynamically deactivates the transmission of proximate audio, video, and other derived features according to the user’s stated privacy preferences—this device also featured a large “panic” button that can be pushed at any time when immediate privacy is desired, blocking audio and video from emanating from nearby Portals.

Rather than block the video stream entirely, we have explored just removing the privacy-desiring person from the video image. By using information from wearable sensors, we can more easily identify the appropriate person in the image,(53) and blend them into the background. We are also looking at the opposite issue—using wearable sensors to detect environmental parameters that hint at potentially hazardous conditions for construction workers and rendering that data in different ways atop real-time video, highlighting workers in situations of particular concern.(54)