

















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
this will be the future of your design
Typology: Lecture notes
Limited-time offer
Uploaded on 04/07/2022
5
(1)10 documents
1 / 25
This page cannot be seen from the preview
Don't miss anything!


















On special offer
Page | 129 FUTURE OF HCI Topic :
Objectives:
Introduction Human–computer interaction (HCI) has contributed much to the advancement of computing and its spread into our everyday living. The prevalent type of interface up to the late twentieth century was the so-called WIMP (windows, icon, mouse, pointer) and graphical user interface (GUI) for the stationary desktop computing environment. This was a huge improvement over its predecessor, the keyboard- input command-oriented interface. Much innovation has been made on the two-dimensional (2-D)-oriented desktop interface since it was first introduced in the early 1980s. These include ergonomic mouse and keyboard design, hypertext and web interface, user interface toolkits, extension of the Fitts’s law, interaction modeling, and evaluation methodologies. If you look more closely, the innovation in HCI has always followed or been accompanied by an advancement of the hardware and software platforms. Even though the original concept of the mouse and graphical user interface was actually devised in the late 1960s by Doug Engelbart, it was not until the early 1980s that the hardware and software technology (not to mention the possibility of personal computing as hardware prices became much more affordable) was mature enough to accommodate the use of a mouse and the GUI (Figure 9.1). This line of thought can give us a good glimpse into the future of HCI based on the fast-changing trends in computing platforms. Here are four major new computing platforms that have emerged in the past 10 years:
Page | 130 ✓ Mobile and handheld platform: (exemplified by the smartphones) which we can carry around to compute and communicate ✓ Ubiquitous platform: in which everyday objects are embedded with interactive computing/networking devices and services ✓ Natural and immersive computing/sensing/display platform: that provides near-realistic services and experiences ✓ Cloud computing platform: that provides high-quality interactive services (based on its heavy-duty ultra-server level computing power) with real-time response (based on the fast network service) In the case of the cloud computing platform, the typical user will not interact directly with the system where the application resides (somewhere in the cloud), but through the client computer or device, such as the everyday desktop computers and mobile devices. Despite the tremendous growth in the computing power of desktop and even mobile units, these stand-alone machines are not usually sufficient for such high-end interactive and intelligent services as image recognition, language understanding, context-based reasoning, and agentlike behavior. Note that these so-called client devices (for the cloud) are becoming increasingly richer in their sensing, display, and network capabilities. In essence, the cloud is taking up the role of the Model and the client View/Controller, where there can be many View/Controllers for different types of clients (e.g., desktops, pads, smartphones). This can be viewed as a way to improve the user experience (UX) by providing high-quality services in real time and having specialized interaction clients focused on usability that are easily deployed (due to their lightness and mobility). For such an envisioned future, it will be necessary to develop middleware solutions that will manage the seamless connection between the Model and one of many possible client View/Controllers.
Page | 132 II. Content 1 Non-WIMP/Natural/Multimodal Interfaces In Chapter 4, we studied the process of HCI design, and after considering various requirements, user characteristics, and operating constraints, we found that the available interface choices were limited, as we had to consolidate different possible solutions according to the restrictions imposed by the practical limitations of the computing platforms available today (e.g., WIMP for desktop and touch-based for smartphones). However, the future will bring the development of many different computing platforms, and we are bound to have more choices, including non- WIMP-type of interfaces that will provide more natural and multimodal interfaces. One of the main reasons these non-WIMP interfaces have not yet made it into the mainstream, despite the apparent need, is the lack of robustness and accuracy, or from another perspective, the relatively large amount of computation required to achieve them. However, the situation is changing due to continued technological innovation and the emergence of the cloud computing infrastructure. In the light of this trend, we now review and assess the future of these HCI technologies one by one, including language understanding, gesture recognition, image recognition, and multimodal interaction.
Page | 133 Language Understanding The talking computer interface is undoubtedly the holy grail of HCI. Language understanding can be largely divided into two processes. The first is recognizing the individual words, and the second is making sense out of the sentence, which is composed of a sequence of recognized words (usually known as natural language understanding). Surely word recognition (which could be spoken, written, or printed) is the prerequisite to the sentence understanding. (Here we focus only on the spoken word or voice recognition.) Voice-recognition performance and its practicality are dependent on the target number of words to be recognized, the number of speakers, the level of the noise in the usage environment, and the need for any special devices (e.g., noise-canceling microphone). The current state of the art seems to be (a) over 95% recognition rate (individual words) for (b) at least millions of words and more than 30 languages (c) in real time (through the high- performance cloud) (d) without speaker-specific training (by age, gender, dialects) (e) in a midlevel noisy environment (e.g., office with ambient noise of around 30 – 40 dB) and (f) with the words spoken relatively closely to cheap noise-canceling microphones or software [2]. Such a state of the art seems to be quite sufficient for a more widespread presence of voice recognition in our current lives, but it is not so except for special situations of disability support or for operating constraints in which both hands are occupied. One main reason seems to be that the users are less tolerant to the 2%–3% of incorrect recognition performance, even though humans themselves do not possess 100% word-recognition capability. Another reason might have to do with the segmentation problem. Often, voice recognition requires a mode during which the input is given in an explicit way, because otherwise it is quite difficult to separate and segregate the actual voice input from the rest (noise, normal conversation) within the stream of voice. The entrance into this mode will typically involve simple additional actions, such as a button push/release. However, users take this to be a significant nuisance in usage. One way to overcome this problem is to rely more on multimodality. To eliminate the segmentation problem, the voice input can be accompanied by certain other modal actions, such as a gesture/ posture and lip movements within a given context so
Page | 135 Gestures Gestures play a very important role in human communication, in many cases unknowingly. Gestures alone can convey meaning, or they can function in a supplemental role in other modes of communication. Consequently, the objective of incorporating gestures into humancomputer interaction is a natural outcome. While there may be many different types of gestures either from the human’s perspective (e.g., supplementary pointing vs. symbolic) or from the technological viewpoint (e.g., static posture vs. moving hand gestures), perhaps the most representative one is the movement of the hand(s). Hands/arms are used often for deictic gestures (e.g., pointing) in verbal communication. For the hearing-impaired, the hands are used to express sign language. To interpret gestures, the gesture itself, whether it is a static posture or involves movement of limb(s), must be captured over time. This is generally called motion tracking and can involve a variety of sensors that are targeted for many different body parts. Here we illustrate the state of the art by first looking at the problem of hand tracking. Good examples of two-dimensional (2-D) hand/finger tracking are the ones using the mouse and touch screen. These technologies are quite mature and highly accurate, helped by the fact that the tracked target (hand/finger) is in direct contact with the devices. In the case of the mouse, the user has to hold the device, and this is a source of nuisance, especially if the user is to express 2 - D gestures rather than just using it freely to control the position of the cursor. This explains why mouse-driven 2 - D gestures have not been accepted by users, their application being limited so far to just a few games. On the other hand, simple 2 - D gestures on the touch screen, such as swipes and flicks, are quite popular. With the advent of ubiquitous and embedded computing, which in many cases will not be able to offer sufficient area/space for 2 - D touch input, understanding of aerial gestures in the 3 - D space, which is actually closer to how humans enact gestures in real life and understand by vision, will become important. Tracking of 3 - D motion of body parts or moving objects is a challenging technological task. The “inside-out” method requires the user to hold (e.g., 3 - D mouse, Wiimote) or attach a sensor to the target body part or object (e.g., hand, head), with both options being perceived as being cumbersome and inconvenient (Figure 9.4). These sensors operate based on a variety of underlying mechanisms
Page | 136 such as detecting the phase differences in electromagnetic waves, inertial dead reckoning with gyros/ acceleration sensors, triangulation with ultrasonic waves, etc. The “outside-in” method requires an installation of the sensor in the environment, external to the user’s body. Using the camera or depth sensors (e.g., Microsoft® Kinect) are examples of the outside-in method. Since the user is free of any devices on one’s body, the movement and gestures become and feel more natural, comfortable, and convenient. However, with the sensors being remote, the tracking accuracy is relatively lower than it is for the inside-out methods. In recent years, camera-based tracking has become a very attractive solution because of innovations in computer-vision technologies and algorithms (e.g., improved accuracy and faster speed), lowered cost and ubiquity of the technology (virtually all smartphones, desktops, laptops, and even smart TVs are equipped with very good cameras), ever-improving processing power (e.g., CPU, GPU, multimedia processing chips), the availability of standard and free computervision/object-recognition/motion- tracking libraries (OpenCV* and OpenNI†), and the ease of their programming (processing language). There are still some restrictions. For example, performance of camera-based tracking is susceptible to environmental lighting condition (Figure 9.5). For highly robust tracking, markers (e.g., passive objects that are easily and robustly detectable by
Page | 138
Page | 139 With all this said, it seems that the major hurdle has been eliminated on our road to more widespread use of motion-based interaction. Yet there still remains one more problem, which is again the same “segmentation” problem that was associated with voice recognition. Similarly, it is a difficult problem to segment the meaningful gestures out of the continuous-motion tracking data. Figure 9.8 illustrates the problem and its difficulty. Again, many current motiongesture systems rely on operating in a particular mode (e.g., applying the gesture while pressing a button, or being in a particular state). However, this defeats the very purpose of the bare hand and truly outside-in sensing. Plus, as already stated, this additional step in the interaction, having to enter the gesture-input mode, lowers the usability dramatically. Innovative algorithms such as those based on the concept of “sliding windows” (continuously monitoring a fixed or variable length of motion stream for the existence of a meaningful gesture) may be able to solve this problem. The segmentation problem is more challenging for gesture recognition because, in the case of voice recognition, the background noise may be low and the detectable spoken inputs intermittent, meaning that the voice-recognition mode can be automatically activated by sound detection (e.g., sound intensity is greater than some threshold). Touch gesture is the same. In most cases, it is natural to expect touches only when a command is actually needed. Thus a touch simply signals the start of the gesture input mode. As for 3 - D motion gestures, users usually continually move, and only part of it may be gestural commands that need to be extracted. Again, as we have indicated, multimodal interaction can partly solve this problem. Finally, in terms of usage, while motion-based interaction may be experiential and realistic, one must remember that it is easily tiring.
Page | 141 Image Recognition and Understanding Image recognition or understanding is perhaps a lesser used technology in HCI, especially for rapidly paced and highly frequent interaction in which the use of mouse/touch/voice input is more common. For instance, the most typical use for face recognition might be for initial authentication (as part of a log-in procedure). Object image recognition might be used in an information search process as an alternative to the usual keyword text-driven approach, e.g., when the name of the object is not known or when it happens to be more convenient to take the photo than typing in or voicing the input. Rather, the underlying technology of image recognition is more meaningful as an important part of object motion tracking (e.g., face/eye recognition for gaze tracking, human body recognition for skeleton tracking, and object/ marker recognition for visual augmentation and spatial registration).
Page | 142 Lately, image understanding has become even more important, as the core technology for mixed and augmented reality (MAR) has attracted much interest. MAR is the technology for augmenting our environment with useful information (Figure 9.11). With the spread of smartphones equipped with high-resolution cameras, GPUs, and light and fashionable see-through projection glasses (not to mention near 2 - GHz processing power), MAR has started to find its way into mainstream usage and may soon revolutionize the way we interact with everyday objects. Moreover, with the cloud infrastructure, the MAR service can become even more robust and high quality. Finally, image recognition can also assume a very important supplementary role in multimodal interaction. It can be used to extract affect properties (e.g., facial expression), disambiguation of spoken words (e.g., deictic gestures), and lip movements). See Figures 9.12 and 9.13.
Page | 144 Although we have already outlined them in Chapter 3, we list them again here. Composed : In this scheme, for a set of subtasks (which together satisfy a larger task), we assign the most appropriate modality to each task. Thus each modality takes up a different role in the interaction. The “put that there” system was one such example, where the voice was used to understand the action command (verb) and the deictic gesture to identify the target object (pronoun). By “most appropriate” we assume and mean that a certain modality is most fitting and natural for a certain type of action. For instance, in a game application, it can be argued that various settings (e.g., selection of the character, weapon, sound options, etc.) can be accomplished with voice or touch interaction for the highest efficiency, while the game itself is played using action gestures for the experience. Note that multimodal interaction does not necessarily mean that different modal interactions occur simultaneously. Alternative : In this scheme, as the name suggests, multiple modal interaction techniques are used for the same subtask independently. The choice is made purely by user preference or by the operational situation. When dialing in a regular situation, one might use the touch interaction, while during driving, voice interaction can be used instead. This way the usability is improved by catering to the user’s preferences and needs (Figure 9.15).
Page | 145 Redundant : In the redundant scheme, many modalities are used together (simultaneously or not) for the same task (input or output). As an interaction method, it makes the act of conveying the intent or information much more robust by combining those of the individual. For instance, an indication of an incoming phone call can use all three modalities: visual, aural, and tactile (vibration). With all three modalities in play, the user is less likely to miss a phone call (Figure 9.16). Another advantage of multimodal interaction is that, to some degree, parallel interaction is possible. We often find people multitasking in different modalities, e.g., walking, listening to music, and texting. The extent of this ability is still a research question. However, it seems quite certain that for this to happen (in an effective and meaningful way), the (multi)tasks must be independent of each other. If each modal interaction
Page | 147 As part of the cloud, and to supplement and complement the onmobile sensors, one particular service to take note of is the sensor network service, i.e., a network of sensors in the environment collectively providing certain services mediated through the cloud. For example, sensor networks can help the mobile client infer the context of usage (e.g., location/area, lighting condition, time, number of people in the vicinity, outdoor/indoor) and provide UX at the personalized level (Figure 9.18). 3 High-End Cloud Service: Multimodal Client Interaction Many interaction technologies require artificial intelligence (AI). After all, recognizing spoken words, sentences, images, and gestures are hallmarks of human intelligence. Advanced AI generally requires large databases, long off-line learning processes, and often heavy online computation (for real-time responses). High-performance servers coupled with mobile clients that handle the fast input data capture and transfer offer an attractive solution. For example, Qualcomm® Vuforia™ [5] is a cloud-based solution for image recognition that can be used for a variety of interactive services such as augmented reality and image-based search. To develop an interactive image-based
Page | 148 service, the developer first registers images of target objects to be recognized in the server ahead of time. These input target images are trained off-line on the server so that they can be recognized well from different viewpoints at different scales and lighting conditions. The mobile application captures an arbitrary image and sends it to the server built with references to the target images of interests. The recognition computation is carried out on the server, with the results sent back to the mobile application for further processing (e.g., augmentation on the screen), all in real time (Figure 9.19). of its sensing and display capabilities. Interactions of the applications in the server can be described and coded only in abstract terms and communicated to the client for actual realization based on the known capabilities of the client device. This way, different models and types of devices can use the same cloud applications and services with interaction customized for users and their particular devices.