Last modified: Wed Feb 3 1999

How to meet the user requirements of room-based videoconferencing

By Tobias Öbrink

Room to room videoconferencing is nothing new. It has been around for more than ten years. In the late 1980s, circuit switched based systems evolved from dedicated studios, via roll-about systems to desktop-based codecs, while in IP-based systems the first applications were desktop-based software videophones that soon evolved into full desktop videoconferencing products. For some reason IP-based room-to-room systems never took that last step to commercial products. There are quite a few theories in the literature trying to explain why. The most commonly recited reason is that packet-based systems doesn't offer sufficient transmission quality. In this paper I propose work to examine some of the technological limitations of an existing, commonly used IP-based desktop videoconferencing application and find enhancements resulting in a packet-based room-to-room videoconferencing system offering sufficient subjective end-to-end quality. The resulting room-to-room videoconferencing system will be used to find the network requirements, i.e. the transmission quality mentioned above, needed by the system.

1 Introduction

1.1 Background

1.2 The organization of this document

2 Related work

2.1 Human factors

2.1.1 Sight

2.1.2 Hearing and speech

2.1.3 Mental factors

2.1.4 Body language

2.2 Video-mediated human communication

2.2.1 Is video really needed?

2.2.2 Mediation-related problems

2.2.3 Display size

2.2.4 Video-mediated body language

2.2.5 The relation between video and audio

2.2.6 Audio-video applications

2.2.7 Analog setup

2.3 Desktop - and room-based videoconferencing

2.3.1 Limitations of desktop - and room-to-room videoconferencing

2.4 VIC, a high-end desktop video tool

2.5 Multipoint-to-multipoint techniques

2.5.1 Point-to-point mesh

2.5.2 Reflector node

2.5.3 IP multicast

2.6 Measurement methods

2.6.1 Subjective methods

2.6.2 Perceptual models

2.6.3 Comparative methods

2.7 Important parameters

2.7.1 Network performance

2.7.2 Computation performance

2.7.3 Media-specific quality parameters

3 Ideas and preliminary results

3.1 What do I mean with high-quality videoconferencing?

3.2 The prototype

3.2.1 Differences between the room-to-room - and desktop videoconferencing situation

3.2.2 Audio-video hardware

3.2.3 Host limitations for videoconferencing

3.3 Evaluation

3.3.1 Limitations of the model

3.4 Measurements

3.4.1 Method

3.4.2 Audio measurements

3.4.3 Video measurements

3.4.4 Limitations of the measurements

3.5 Summary

3.6 Future work

4 Action / Time plan / Milestones

4.1 Literature study

4.1.1 Literature study on measurement methods and parameters

4.1.2 Host limitations and add-on equipment

4.2 System implementation

4.2.1 Performance measurements of the current application

4.2.2 Suggest enhancements and develop prototype

4.3 Evaluation

4.4 Conclusion

4.5 Writing

5 Rough outline of the thesis

6 Terminology and Concepts

7 References

Appendix A. Multipoint techniques

Appendix B. Host limitations

Appendix C. System parameters

Appendix D. Levels of flow separation

Appendix E. Local transmission echo

Appendix F. Implementation plan

Appendix G. Evaluation methodology

1 Introduction

Right now the transmission capacity of optical fibers is increasing faster (proportionally) than the computing power of processors. Therefore it is likely that future tele/computer-systems will be optimized to save processing power in network nodes and hosts rather than link transmission capacity. The bandwidth available to the average user will increase faster than the commonly affordable computing power. Following this reasoning the Telecommunication Systems Laboratory at the department of Teleinformatics, KTH is researching network support and protocols for very high speed networks to take advantage of the expected future performance increase of fiberoptic equipment. Another important research area is the backbone support for the extension of high bandwidth connections to mobile clients to meet the user's demand for mobility.

One of the visions of the future is that of a global network containing very powerful (and expensive) servers serving a multitude of more or less intelligent client terminals over high bandwidth fiberoptic and wireless connections. Another vision is of a multicast-supporting network with distributed applications migrating through the network to serve the user at the current location(s) of the user. These two visions are not necessarily contradictory.

Our research has so far been mainly concentrated on the network level of the architecture to efficiently use the potential of fiberoptics, better utilize available radio frequency bands, and to support quality of service guarantees and group communication. Now, we are starting to research higher level support for the services in the architecture, such as network management, transport protocols, and security.

However, to research support for a set of future services would be quite meaningless without a good picture of the demands of these services. Therefore a group at TS-Lab is researching future content offered to the future user, and the lower level support demands this content will put on the underlying infrastructure.

1.1 Background

So what are the future services? And more importantly, what are the demands of the future services on the network infrastructure? I believe future services will be all about communication and cooperation between people, so therefore in my research plan I declared that I will research multicast-based, distributed applications taking advantage of high performance, heterogenous networks that may consist of both physical links and wireless links.

The most common prediction that one can find in literature debating future services is that it will be a lot of video and virtual reality. There are a lot of video-based services as can be seen in the related works section and their network requirements are quite different. The same applies for virtual reality although I haven't bothered to include a corresponding listing for that service.

In this paper I propose work to investigate in detail the network requirements of one type of video-based service for communication between people, namely videoconferencing. Videoconferencing is nothing new. It has been around for more than ten years [4, 13, 14], starting with the building of dedicated studios connected by circuit-switched networks. Next step in the evolution was a packetized movable system called roll-about, and the final development was add-on products for desktop computers. In the case of IP-based systems the first applications were crude software videophone toys developed on desktop computers, that soon evolved into full desktop videoconferencing products [14].

For some reason the IP-based room-to-room systems never took that last step to commercial products. There are quite a few theories in the literature trying to explain why. The most commonly recited reason is that packet-based systems doesn't offer sufficient transmission quality. Since the quality seems to be sufficient for desktop videoconferencing, this statement led me to think that room-to-room videoconferencing puts higher demands on the network infrastructure than the desktop dito and thus should pose a greater challenge.

1.2 The organization of this document

In the next chapter you will find a summary of work related to this thesis proposal. It is far from complete but should give a fair understanding of the issues discussed in chapter 3, where I present my ideas and pointers to some preliminary results. Following is my time plan in chapter 4 and a rough outline of the thesis in chapter 5. In chapter 6 is a list of terms used in this paper and their definitions, just to avoid misunderstandings. Next is a list of references used in this paper, followed by a sequence of appendices describing in more detail the preliminary results used in chapter 3.

2 Related work

Most of the work introduced here belongs to the field of Human-Computer Interaction (HCI) and Computer-Supported Cooperative Work (CSCW) and some belongs to telecommunications.

2.1 Human factors

In my work I assume that humans are the final end-points in the communication and therefore it is natural to take a normal human's abilities as the base from where to study the videoconferencing system. Much data on this can be found in literature on HCI and CSCW that in turn references to works in the fields of psychophysics, ergonomics, information science and technology, systems design and cognitive psychology as the source of information about the human abilities and limitations. The senses used in a videoconferencing session is mainly sight and hearing, therefore a lot of the information in the literature covers the limitations of these senses. More subtle factors are mental factors, such as association, and all the social rules that people apply when communicating. Body language, tone of voice and eye contact are all crucial in enabling smooth conversation.

2.1.1 Sight

In [14] F. Fluckiger gives some information on how the human eye and visual processing works. Only radiation of certain wavelengths is visible - those lying in the visible range: from about 250 nm to 780 nm. The human eye also can discriminate between different wavelengths and each wavelength creates a different impression referred to as the color sensation. The eye is more sensitive to certain wavelengths than to others, implying the human eye is more sensitive to certain colors than to others. For example yellow or yellow-green seems brighter than red or violet.

To determine the distance to the point of focus, we use binocular convergence, that is the angle between the line of sight of each eye. We also use binocular parallax - the differences between the two images due to the space between the eyes, for the same purpose. By combining the two pieces of information, the brain creates a sensation of depth.

Motion is also basic for visual perception. We move towards and from things, we look around them and move them to examine them closer. Movement also contributes with information about the three dimensional layout in form of motion parallaxes.

2.1.2 Hearing and speech

In [14] F. Fluckiger says that the range of frequencies that humans generate when speaking is normally 50 Hz to 10 kHz, but the range of frequencies that humans can hear is normally 15 Hz to about 20 kHz (depending on age). The consequence is that the bandwidth necessary to communicate speech is smaller than that needed for other sound, e.g. music. During a speech, sequences of phonemes follow silent periods. Typically 60% of the speech consists of silent periods.

The human is also able to determine the location of the source of a sound in a three-dimensional space. The first clue is based on the intensity difference of the two stimuli presented to our ears. Likewise, the waveform will reach each of our ears at two distinct instants in time. The conjunction of the intensity and time difference produces an impression of lateralization. The ear also filters certain frequencies more than others, which helps detecting if the source is front, back, up or down from us. Sounds reverberate on surrounding objects and change when we move our head, which also helps in determining the position of the source. Three-dimensional sound helps to give situation awareness and to discriminate between participants in a conference.

In [?] W. Stallings says that frequency components of speech may be found between 20Hz and 20kHz and that frequencies up to 600 to 700 Hz add very little to the intelligibility or emotional content of speech.

2.1.3 Mental factors

In [14] F. Fluckiger notes that the user's tolerance will generally be derived from their experience of comparable applications. For example, when viewing a movie-on-demand, a subscriber in the USA will compare the quality to that of NTSC over-the-air cable programs.

In [11] A. J. Dix, G. D. Abowd, R. Beale and J. E. Finley also discuss transfer effects; when we come to use computer-mediated forms of communication, we carry forward all our expectations and social norms from face-to-face communication. The rules of face-to-face conversation are not conscious, so when they are broken we do not always recognize the true problem. Therefore, success with a new media is often dependent on whether the participants can use their existing norms.

They also also discuss something called personal space. When we converse with one another we tend to stand with our heads a fairly constant distance apart. We can accept people closer to us if they are at our sides or behind us than if we are facing them. These distances form a space called the personal space. The exact distance depends somewhat on context. A high level of noise may make people come closer just to be heard. Also if the conversants want to talk privately they tend to come closer. Personal space also differs across cultures, North Americans get closer than Britons, and southern Europeans and Arabs closer still. This can cause considerable problems during cross-cultural meetings.

2.1.4 Body language

In [32] C. Katzeff and K Skantz states that compared to telephony, video offer a lot of information that participants use to enhance their interaction. Humans usually has broad experience of interpreting small nuances in facial expressions, gestures and posture and adapts their dialogue in reponse to these interpretations. If the speaker see that the audience looks bewildered, then he can explain more in detail or ask the audience what causes the confusion. Socialpsychological studies has shown that ca 30% of a face-to-face conversation consists of mutual glances. Those glances are considered to have at least five functions; to guide the flow of conversation, to give feedback from the listener, to communicate feelings, to communicate the character of the relationship between the conversing people, and to mirror the status of the relationship between the conversing people.

In [11] A. J. Dix, G. D. Abowd, R. Beale and J. E. Finley states that our eyes tell us whether our colleague is listening or not, they can convey interest, confusion or boredom. This involves not just the eyes, but the whole facial expression. Sporadic direct eye contact is important in establishing a sense of engagement and social presence. Eye gaze is useful in establishing the focus of the conversation. If you say `now where does this screw go?', there may be many screws, but if your colleague can see which one you are looking at then he/she is able to interpret which one you mean. In a similar but more direct way, we use our hands to indicate items of interest. This may be conscious and deliberate as we point to the item, or may be a slight wave of the hand or alignment of the body.

2.2 Video-mediated human communication

There have been many studies comparing video-mediated human communication with both audio-only and face-to-face communications. Here I present some of the findings. First is a general discussion on the use of video, followed by a few examples showing some problems one will run into when using the video - and audio medium as a substitute for face-to-face interaction. Next is a short note on the relation between audio and video in human communication followed by a list over some typical audio-video applications and a collection of different analog setups.

2.2.1 Is video really needed?

In [32] C. Katzeff and K Skantz presents the results of a literature survey to find the state of the art of the role of video quality in human communication. Some of their findings that I think relate to my work is presented here.

They found that there is so far no mapping between which aspects of video quality that influence which aspects of human communication. Further, they found that participants who don't know each other in advance have not been proven to solve problems more efficiently if one add video to an audio connection than if only using the audio.

One theory suggested by K. E. Finn, A. J. Sellen and S. B. Wilbur in [4] is that the effects of video are too subtle and long-term to be measurable in simple problem-solving tests, which is a frequently used test type in HCI.

As noted in 1.1 above, video offer an additional feedback channel that participants use to enhance their interaction. Visual feedback is also the reason why C. Katzeff and K Skantz believes video interaction is significantly richer, more subtle and simpler than interaction over telephone only. A conclusion of this belief is that video is of most use in situations where the richness of the human interaction is most needed, e.g. conflict resolution, negotiations and establishing relations. They also state that this feedback, obviously, is not possible in a voice activated video scheme where the video feed shows only the current speaker.

2.2.2 Mediation-related problems

Mediation-related problems are mainly caused by partly connecting two normally detached environments. Many of these problems can be overcome by mediating even more information than only audio and video.

In [11] A. J. Dix, G. D. Abowd, R. Beale and J. E. Finley notes that a problem with personal space can occur in a videoconference. If each participant adjust the zoom of the camera in his/her site according to his/her cultural norm we risk ending up in the same difficulties as in face-to-face meetings, unnecessarily. To avoid this in a two-site session the participating groups should be able to adjust the zoom of the camera at the other site to achieve acceptable personal space.

A similar situation can occur with audio; participants that want to talk privately tend lean close to the screen even though this may have no effect whatsoever on the sound level on the remote side. I guess that this can be avoided by mediating three dimensional sound. It is also quite easy to move out of the range of the camera, whilst still being able to see your colleague.

Another problem identified by F. Fluckiger in [14] is the "flat" representation of video. Binocular parallax can be simulated using stereoscopic display techniques, but simulating binocular convergence is trickier. Moving around to get a better look at something on the other side of the screen is also quite useless.

2.2.3 Display size

In [11] A. J. Dix, G. D. Abowd, R. Beale and J. E. Finley states that even poor quality video or very small monitors can convey facial expressions.

But in [32] C. Katzeff and K Skantz presents results showing that a small video display (40 mm high, 65 mm wide) of the same resolution as a larger video display (103 mm high, 140 mm wide) showing the face of an other participant results in a more telephone-like conversation than the larger video display. The conclusion is that if the goal of the videoconference is to offer an illusion of presence with fluid conversations and a lot of interactivity, then a large video display is preferable. If the goal is to get a clear and distinct conversation, then a video link is not necessary at all. They also presents results showing that the size and position of a video window as compared to other video windows on a shared display convey information on the participants' status and personal relations.

2.2.4 Video-mediated body language

In [11] A. J. Dix, G. D. Abowd, R. Beale and J. E. Finley says that video connections are unlikely to show enough of your office for your colleague to be able to follow your gaze or hand or body alignment. This can be a serious problem since our conversation is full of expressions such as "let's move this one there", where the "this" and "there" are indicated by gestures or eye gaze. A focus that just catches the corner of the monitor and desk can help.

Furthermore, eye contact is hindered by the placement of the camera apart from the screen. If you put a camera on top of the monitor, your partner will always see you looking slightly downwards, and vice versa.

Even when the participants are in the same room, the existence of electronic equipment can interfere with the body language used in normal face-to-face communication. The fact that attention is focused on keyboard and screen can reduce the opportunities for eye contact.

2.2.5 The relation between video and audio

The ear and the eye works very differently. In [14] F. Fluckiger says that the ear may be modelled as a differentiator, it is very easy to hear sounds from different sources even if they are intermixed. The eye works as an integrator, it is extremely difficult to recognize separate images if they are mixed, and it's difficult to see changes less than a few seconds long made in a familiar video clip. The consequence is that humans are much more sensitive to alterations of audio - than of visual signals and thus less tolerant of audio - than of video errors. When the two streams compete for the same network - and end-system resources, the audio stream should have the higher priority if possible.

According to the literature survey conducted by C. Katzeff and K Skantz in [32], especially in the case of face-to-face communication, synchronisation between audio and video is important to avoid misunderstandings about turn-taking and intent. They also state that delayed video in relation to audio can be strenuous to look at for a longer time and don't give a full sensation of presence. For these reasons trials with delayed audio to achieve lip synchronisation have been done. However, small audio delays have shown to seriously deteriorate participants ability to come to conclusion and also seriously lessen the participants' satisfaction with the conversation. The additional audio delay seem to cancel the benefit from the lip synchronization by far.

2.2.6 Audio-video applications

There are a lot of different audio- and video-based computerized communication tools and in many papers the term "videoconference" can stand for almost anything mediating audio and video. To avoid confusion and to show the relation between different groups of audio- and video-based computerized communication services I present a list of definitions that I've stumbled across in the literature. In [14] F. Fluckiger introduces a range of more or less computerized applications of audio-video communication.

Computer-assisted circuit telephony. A computer is used for part or all of the functionality of a terminal connected to a circuit telephony system. It can also provide a multitude of complementary services, e.g. directory functions, message recording, call distribution etc..

Packet voice conversation. Same as computer-assisted telephony, but the underlying network is packet switched.

Videophony. Telephony with motion video. In practice 6 to 12 fps and a frame size large enough to show a talking head view at low to medium resolution, sometimes with the forwarded video in a smaller window superimposed on the received video. Videophones may be video-extended telephone sets, so called video dialtones, or a computer equipped with necessary hardware and software.

Video seminar distribution. Generally includes a speaker and his/her notes. The speaker is a moving target and this demands either a camera operator or a wide camera shot. As for videophony, the movement of the speaker is more important than the resolution. The notes, consisting of a chalkboard or a slide show, is generally fixed and includes important details. This demands a focused camera shot and resolution is more important than movement. An optimal technique would be for the video operator to switch between two trade-off modes. Receivers tend to be more tolerant of sound distortions or gaps when passively listening, than during a conversation.

Audio-video conferencing. Usually abbreviated to videoconferencing. The objective is to support a meeting between more than two remote participants. If biparty, the conference connects groups of people; if multiparty, it may connect a mixture of groups and individuals. Participants may gather either in an office using desktop or rollabout systems, or in a meeting room, but in both cases the system have to cope with several heads and shoulders, i.e. a group view. A frame rate of 8 to 12 fps is acceptable though of course entailing a jerky effect. The resolution may be of medium quality. The sound quality requirement is even more stringent than for videophony. Documents need to be exchanged, either on paper or in projected or in electronic form. If the meeting is symmetrical - that is, with balanced contributions from all participating sites - six to eight systems is a practical limit beyond which floor passing becomes difficult to manage. A figure of 12 sites is cited as the limit beyond which anarchy becomes inevitable, unless some participants are very passive.

In [32] C. Katzeff and K. Skantz presents yet another audio-video application:

Media space. A media space is a computer controlled network of audio and video equipment for collaboration between groups of people distributed in time and space. A media space is continually available and is therefore not a service that is only available at certain predetermined times.

In [11] A. J. Dix, G. D. Abowd, R. Beale and J. E. Finley presents the

Video-wall. Also called a video-window. A very large television screen is set into the wall of common rooms at different sites. The idea is that as people wander about the common room at one site, they can see and talk to people at the other site, giving a sense of social presence.

2.2.7 Analog setup

In [14] F. Fluckiger presents different types of analog setups for the different applications presented in the preceding section.

Talking head view. Designed for face-to-face conversation, and optimally used by two persons only. A single fixed camera is positioned close to the display and focused to capture one head and two shoulders. The main purpose of the video is to show emotional information and provide eye-to-eye contact. No details need to be accurately displayed, but movement is crucial .

Side view. Experience has shown that, when used for cooperative work in conjunction to other teleconferencing tools, the optimal positioning of the camera is a side position which provides more information on the context. This demands higher resolution than the talking head view.

Group view. If using one fixed camera taking in a view of the whole group it may be difficult to identify the speaker if the group exceeds five people. The camera should ideally be placed at the end of a long table. If using a moving camera it can take a view of the current speaker, may use preset positions, may be controlled by local chair, local operator, from the remote side or by voice activation. If using two cameras one can take an occasional overall view while the other zoom in on the current speaker. In addition, the size of the display monitor must be suitable for a group of viewers. Large systems use 27 inch monitors. Since the groups need to communicate among themselves it often is not practical for the participating individuals in a given room to wear headphones. Therefore loudspeakers with a properly implemented echo avoidance have to be used.

Electronic document exchange. Document handling may imply one or several of the following facilities: a dedicated document camera generally requiring higher quality than the regular speaker camera, a good resolution camera to capture an overhead screen or chalkboard, fast scanners, a data channel to transmit computer-generated spreadsheets or other digital documents.

2.3 Desktop - and room-based videoconferencing

As noted above, there are two main groups of videoconferencing - room based videoconferencing and desktop videoconferencing. To understand the differences between the two, one can take a look at the history. In [14] F. Fluckiger gives an account of the history of videoconferencing.

The circuit-switched-based videoconferencing systems appeared in the 1980s. The first services were provided by public telephone operators (PTOs) in dedicated meeting rooms, equipped with analog audio-visual devices, digitizers and compressor/decompressor systems as well as a connection to the PTO's internal circuit-switched network. In the second half of the 1980s dedicated videoconference products appeared on the market in the form of packaged systems with TV cameras, microphones, speakers, monitors and modules for digitization, compression and decompression. They were installed in private videoconference studios connected by leased lines or aggregated telephone connections. These packaged, stand-alone videoconference systems are called video-codecs and generally are connected directly to circuit-switched networks at speeds from 112 or 128 Kbps to 336 or 384 Kbps. Most offer an optional document video camera and some support two or more room cameras. The next development was the introduction of rollabout systems - a circuit videoconference system that can be moved between meeting rooms. Lighter and cheaper than static video-codecs and operating at speeds from 112 or 128 Kbps to 336 or 384 Kbps. Rollabout systems generally offer fewer optional features than static video-codecs. The latest generation of circuit-switched videoconference systems addresses the desktop mode where the service is provided in offices. These systems offer even fewer features.

Packet-based videoconference systems, on the other hand, has evolved from interpersonal systems, i.e. videophony, to multiparty desktop systems, and can finally be used in room or mixed room/desktop environments. The first generation of products is, however, not provided with dedicated facilities such as camera prepositioning, document camera, or sophisticated audio handling. Most applications use the IP or IPX protocol. They may offer high resolution at low framerate, but audio equipment is either fair or medium quality and echo cancellation is seldom treated properly.

Packet videoconferencing also support room mode, especially for asymmetric conferences, but the first generation of products is not provided with dedicated facilities such as camera prepositioning, document camera, sophisticated audio handling or floor control.

In [32] C. Katzeff and K. Skantz says that advances in computer technology like faster processors and better systems for compression have made it possible to integrate video data in the day-to-day computer environment. In this way "desktop videoconferencing" have arised. By adding soft- and hardware to one's ordinary desktop computer it's now possible to videoconference from the desktop.

The main difference thus is the settings for which the applications are optimized for. In [14] F. Fluckiger defines a "Desktop mode" and a "Dedicated rooms mode" Desktop mode is defined as when the service is delivered on the end-user's desktop. The tendency is to use the regular desktop computer system of the user instead of installing additional devices such as a TV monitor. When the end-user is not familiar with computers, bringing multimedia services to the desktop often results in installing dedicated stand-alone devices such as integrated videophones. In the dedicated rooms mode, also called studio mode, an organization has specially equipped dedicated multimedia rooms for videoconferencing.

In [4] K. E. Finn, A. J. Sellen and S. B. Wilbur presents a collection of articles on video-mediated communication. One important note is that desktop videoconferencing applications generally take advantage of having a window system to provide additional information, for example awareness, and ability to use other applications at the same time on the same machine. In a room-to-room conference you are not limited to run all these features on the same screen or even the same machine. They also show several examples where people have taken advantage of the spatial properties of the room to provide more intuitive awareness and support for social cues than can be delivered on the limited screen real-estate of a typical desktop computer.

Although F. Fluckiger states that packet videoconferencing also support room mode, as of November 1998 I haven't been able to find one specialized IP-based product comparable to the H.320-based videoconferencing studios.

2.3.1 Limitations of desktop - and room-to-room videoconferencing

In [14] F. Fluckiger states one fundamental limitation of desktop videoconferencing, the one about technical requirements. The minimum is to install an audio-video client software module. According to mr Fluckiger, a software module is only sufficient for low frame rate and low-resolution quality. The CPU load of the receiving desktop computer can be alleviated by the use of an outboard decompression card. Then the CPU no longer acts as the limiting factor to the displayed frame rate. Certain video compression/decompression cards replace the regular display memory by a dedicated high-resolution (24 bits) frame buffer.

In [32] C. Katzeff and K. Skantz states that room based videoconference demands specially equiped rooms with expensive hardware. On the other hand, this is justified by a pooling of resources.

2.4 VIC, a high-end desktop video tool

VIC is one of the most successful desktop video tools for packet-based networks. Unlike many of it's predecessors, VIC is highly optimized for a typical Internet environment with lossy connections and low-end desktop computers. The source code is freely available and modular as described in [29]. This makes VIC a good platform for prototyping. The latest version is available from the MASH research group at the University of California, Berkeley [33].

In [1] S. McCanne and V. Jacobson presents the UCB/LBNL VIdeo Conferencing tool (VIC). VIC was designed with a flexible and extensible architecture to support heterogeneous environments and configurations. For example, in high bandwidth settings, multi-megabit full-motion JPEG streams can be sourced using hardware assisted compression, while in low bandwidth environments like the Internet, aggressive low bit-rate coding can be carried out in software.

VIC provides the video portion of a suite of applications for multimedia conferences developed at UCB; audio (VAT), whiteboard (WB), and session control (SD) tools are implemented as separate applications. Those tools can interoperate through a Conference Bus to support features such as voice-switched windows, synchronized media and having a remote moderator to enforce floor control. A serious deficiency is that the tools have completely independent user interfaces. For example, VAT, VIC, and WB all employ their own user interface elements to display the members of a session, which leads to multiple (possibly inconsistent) lists. S. McCanne and V. Jacobson suggests that a better model would be to have a single instance of this list across all the tools for multimedia conferences.

In VIC, awareness of other participants is supported through a list of thumbnail views of each active participant. There is also a separate members-list listing both active and passive participants. A thumbnail image is not updated in real-time, but rather every few seconds, but it can be enlarged in a separate window into one of several different supported formats to see the actual received video stream. VIC uses RTPv2 [10] for video transport and for gathering awareness information and network statistics. To Provide confidentiality to a session, VIC implements end-to-end encryption using the Data Encryption Standard (DES).

2.5 Multipoint-to-multipoint techniques

One key characteristic of videoconferencing is the need to support communication between groups of people. This may imply only one connection between two groups of people or multiple connections between single people or any combination of those extremes. Multipoint-to-multipoint videoconferencing naturally implies connections between more than two sites, so what alternatives are available for multipoint connections over IP?

2.5.1 Point-to-point mesh

The earliest and still most robust way to provide multipoint conferencing is to set up a mesh of connections between the participants. Every end-system is directly connected to all the others, requiring n2 - n connections to fully interconnect n systems. IP-based point-to-point meshes are implemented in the Application Layer of the OSI model. ISDN or similar circuit switched networks requires as many biparty connections (circuits) as remote sites for unidirectional transmission. A practical observed limit is one source and seven receivers [14]. One advantage with point-to-point connections is that confidentiality is higher as the conference only involves the members specifically accepted and with which dedicated connections have been set up. Early versions of ISABEL [19] used point-to-point as well as Communique! [37].

2.5.2 Reflector node

In circuit switched networks, a video branch exchange (VBX), or videoconferencing hub (video-hub for short) is used to provide a star point which acts as a switch between all participating video-codecs. Video hubs can be switching digital signals directly, e.g. a multipoint control unit or multiparty conferencing unit (MCU), or the signals can be converted to analog signals, called video-switches or video mixers. Video switches generally introduces an additional one second delay due to the D/A and A/D conversion. Advanced MCUs are capable of translating between different audio and video encoding and compression schemes. They are called transcoder MCUs. Usual video hubs allow for eight participating systems in multiparty conferences, some systems support up to 24 calls. Video hubs may be chained to form a cascade, thus extending the topology from a star to a tree scheme. Many video-hubs use voice-activation to only forward video from the party generating sound at a given moment [14]. The most commonly used videoconferencing applications today use point to point setups with H.32x (e.g. Microsoft's Netmeeting and PictureTel's products). H.32x supports multipoint conferencing through the use of a MCU. The IP-based reflector approach is implemented in the Application Layer of the OSI model. Cu-SeeMe [38] is an example of a non-H.32x conferencing product using reflectors.

2.5.3 IP multicast

A solution that is only available to IP-based systems is to use IP multicast support in the infrastructure to provide multipoint conferencing. IP multicast is implemented in the Network Layer of the OSI model. Multicasting is the capability of the network to replicate, at certain internal points, the data emitted by a source. Replicated data should only be forwarded to the recipient end-systems which are part of the multicast group so as to avoid or minimize segments of the networks to be traversed by multiple copies of the same data [12, 14, 17, 18, 36]. Depending on the network topology, the use of IP multicast instead of a MCU or a mesh helps avoiding unnecessary waste of bandwidth (see Appendix A). IP multicast also scales better to large numbers of participants in a conference, with respect to network- and end-host resources, than the reflector- and point-to-point mesh solutions [25]. The MBone tools [39] uses IP multicast.

2.6 Measurement methods

Here I will give an account of a few methods for measuring the quality offered by an audio-video application. Since the end-points of the communication are humans, sooner or later all measurement results must be mapped to subjective ratings of perceived quality. These measurements tend to be both expensive and lengthy, however, so there have been efforts to develop mesurement tools that mimic human senses and map indata to specific subjective rating scores, so called perceptual models. Even simpler are the comparative method where one measures one parameter while varying another. The comparative method puts high requirements on the choice of parameters to measure and vary as well as on the fullness of documentation of those parameters.

2.6.1 Subjective methods

In [21] two masters degree students, T. Poles and K. Elezaj, did some measurements on the sound quality degradation related to the number of Internet telephone calls over a 10 Base TX Ethernet. The measurement methods they considered were the Mean opinion score (MOS) and the Degradation mean opinion score (DMOS). MOS is a method for obtaining subjective ratings of perceived quality on a five-grade scale ranging from 1, Bad, to 5, Excellent. The MOS value is extracted from the results of an Absolute Category Rated (ACR) test performed on 20 to 60 untrained persons. The DMOS is a method for obtaining subjective measures including ratings of perceived quality degradation compared to an original. DMOS uses a five-grade scale ranging from 1, Very annoying, to 5, Inaudible. The DMOS value is extracted from the results of an Degradation Category Rated (DCR) test performed on 20 to 60 untrained persons.

2.6.2 Perceptual models

In [26] F. Bock, H. Walter and M. Wilde presents an image distortion measure adapted to human perception (DMHP) and compares it to the most common quantitive measure, the mean square error (MSE) for a couple of well-known test images. The DMHP system separates the original image into three characteristic classes, namely edges, textures and flat regions. The errors of each class can be globally weighted according to human perception and then the MSE on those values gives a DMHP value. In the comparison the local variance is used for assessment of errors in the textures and the flat classes while in the edge class the assessment is weighted depending on if the error is a lost edge, a changed edge or a new edge according to a rule defining what is perceivable errors. All the errors are also weighted based on the background illumination level.

In [27] C. J. van den Branden Lambrecht presents a general architecture for the end-to-end testing of digital video transmission systems based on a model of spatio-temporal vision where most major aspects of human vision are addressed, namely the multi-resolution structure of vision, sensitivity to contrast, visual masking, spatio-temporal interactions and color perception. The model is parameterized and several video quality metrics are introduced and tested on compressed video material and compared to subjective tests and tests with a quantitative video quality metric developed by the Institute of Telecommunication Science (ITS) in Colorado.

In [34] A. B. watson presents a Digital Video Quality (DVQ) metric that is reasonably accurate, but computationally efficient. Because most video coding standards used today use the Discrete Cosine Transform (DCT), the DVQ metric computes the visibility of artifacts expressed in the DCT domain using a model of human spatial, temporal and chromatic visual processing. The metric incorporates human spatial, temporal and chromatic contrast sensitivity, light adaption and contrast masking.

2.6.3 Comparative methods

In [7] P. Bagl, P. S. Gauthier and R. Ulichney presents issues related to the implementation of a platform-independent software architecture for video handling, called the Software Video Library (SLIB), as well as some sample applications built on top of the software architecture.

To optimize the SLIB the authors tested how computation intensive different components in playback of compressed video were for M-JPEG, MPEG-1 and INDEO in percent of total computation.

The authors also presents some schemes for measuring the performance and quality of video codecs, i.e. compression ratio, average output bits per pixel and Peak Signal to Noise Ratio (PSNR), and issues affecting the measurement, such as the video contents' color distribution and the amount of motion. They used these measurements on three, not named, video sequences and also measured percentage of CPU usage and the average framerate on four different workstations with software-only - and hardware video rendering. They also compared the performance of compression, as described above, when reading raw video from disk instead of capturing it.

In [9] L. D. Seiler and R. A. Ulichney describes various design alternatives when integrating video rendering into a pair of graphics accelerator chips and presents performance comparisons with software-only image rendering. The parameters measured were the amount of data that the software must process and that must be transmitted over the bus as percent of the capacity, percent of total computation that went to rendering and display and performance in frames per second (fps) for displaying the standard Motion Picture Expert Group (MPEG) flower garden video.

2.7 Important parameters

What are the parameters that are interesting to measure in a videoconferencing system? The system can be roughly divided into a network-, a computation - and a media part. The network performance parameter values are what I'm looking for, the computation performance parameters are boundary values dependent on the system solution chosen and finally the media-related parameters are another set of boundary values determining the end-users' perceived quality.

A somewhat up to date listing of network -, computation and media-specific performance parameters and suggested values, that I have found in literature, is presented in Appendix C.

2.7.1 Network performance

In [2] P. Bagnall, R. Briscoe and A. Poppitt define a classification system for the communication requirements of any large-scale multicast application (LSMA). The resulting taxonomy is intended to be used by an application when sending its requirements to a dynamic protocol adaption mechanism. One of their recommended uses of this taxonomy is for it to be used as a checklist to create a requirements statement for a particular Large Scale Multicast Application.

In [14] F. Fluckiger says that the key characteristics which are most relevant to multimedia applications are the access delay, support for isochronism, type of bandwidth guarantee, and multicast support. A quite up to date survey for LANs is shown in Figure 21.15, for circuit WANs in Figure 23.1,for ATM WANs in Figure 24.8, for Frame Relay & SMDS WANs in Figure 25.10, and for packet WANs in Figure 22.10 in [14].

2.7.2 Computation performance

By computation performance I mean all handling of a signal from the sampling stage until the packet is handed over to the network interface card (NIC) for transmission, and similarly on the receiver side the packet is received from the NIC, processed and finally handed out to the display/playout device.

2.7.3 Media-specific quality parameters

Video and audio have different terminology and different quality parameters both relating to the perceived end-to-end quality as well as to the characteristics of the different compression schemes.

3 Ideas and preliminary results

How to meet the user requirements of a room-based videoconference system? The first step is to identify what the user requirements really is. Here I have a theory that I describe further in section 3.1. The next step is to take what you have and see if it complies with your expectations. This I have done and the result is presented in Appendix B. If it doesn't comply with your expectations, try to fix it. This is what I propose to do.

3.1 What do I mean with high-quality videoconferencing?

In [14] F. Fluckiger defines a set of video- and audio qualities based on the quality of existing audio-video services. His definition of videoconferencing quality consequently is the quality delivered by a ITU-T H.320 codec over a 128 kbps connection (see Appendix C). In H.320, a telephone quality audio stream takes between 4 to 64 kbps depending on which audio codec is used and what's left goes to a H.261 video stream in CIF or QCIF format delivering on average 8 to 12 frames per second. Following the reasoning in section 2.1.3. on mental factors the users will bring with them experience from similar situations. In the case of a room-based video-conferencing system using a television screen to display other participating sites, the user naturally will compare the perceived quality to broadcast TV quality, which is better than telephone quality audio and two or four times the frame resolution at two to three times the frame rate. This comparison will not be in favor for the videoconference system. With high-quality I mean at least the same perceived quality as provided by similar audio-video services, preferably better. Thus I will use the following definitions of high quality audio and video.

Definition of high quality audio:

Low distortion (CD quality, BER < 10-4)
No echo (<=24 ms transit delay or echo canceling)
Low delay (< 100 ms one-way)
Low jitter (< 100 ms)

Definition of high quality video:

High frame rate (25 fps PAL, 30 fps NTSC, 60 fps HDTV)
Large frame size (PAL, NTSC, ITU-R 601, or preferably HDTV)
Sufficient colors (24+ bit pixel depth)
Low delay (< 150 ms TV, < 100 HDTV)
Low jitter (< 100 ms TV, < 50 ms HDTV)
Lowest end-to-end distortion possible (Compression, bit errors, packet loss)

The values given within parantheses are extracted from Appendix C, and may change if I will come across some more enlightening literature. The definitions given here are for perceived quality, and thus cannot be directly translated into network requirements. There is a lot of stuff in between as shown in the next section.

3.2 The prototype

There are few room-to-room systems available that offers the end-to-end services that I want. Therefore I plan to implement a highly specialized room-to-room videoconferencing system supporting high-quality videoconferencing to use as a testing prototype. Since I have no experience in hardware design or DSP programming, I will use ordinary workstations and optimize the overall system - and software architecture for a room-based scenario and as far as possible use the standard audio and video support for the platforms to provide an end-to-end quality that the room-based situation demands. I will use the same machines as for the preliminary work, namely a Sun Ultra 2 Creator 3D and a Silicon Graphics O2. Further I will not implement a whole new system from scratch, but instead modify an existing desktop videoconferencing system consisting of the applications VIC and RAT and turn it into a specialized room-to-room videoconferencing system.

I have identified a few issues related to the implementation of this prototype:

First, to be able to modify a desktop videoconference into a room-to-room dito providing the end-to-end quality as defined above, I have to compare the differences between the room-to-room - and the desktop videoconferencing situations.

Second, most of the hardware support for video and audio that I have seen for different platforms have been optimized for less quality than I want to provide.

Third, the performance of the computers that I have available seems lacking, It seems that software-only compression implies a certain data loss due to lack of resources.

I have also found some suggested solutions to these three issues as presented below which is reflected in the implementation plan in Appendix F.

3.2.1 Differences between the room-to-room - and desktop videoconferencing situation

In the related works section I presented some information on this issue found in literature. Much of that information is in line with my own observations.

The first observation have to do with the supported functionality. The IP-based desktop videoconferencing applications that I have used could easily be used also in a room-to-room scenario. And when doing so we usually ended up with a room-based system offering only a subset of the functionality originally supported by the desktop videoconferencing application. Participant lists and other awareness-related information didn't make it to the TV - or projection screen. Sharing digital documents and workspaces between rooms became somewhat awkward and often resulted in a main camera view of groups of people hanging around computer screens on the participating sites for most of the time. Camera control was not supported by any of the desktop videoconferencing applications that I have tested, although there are several implementations that can be used in combination with the audio- and video tools. Support for transmitting multiple video feeds, e.g. for a document camera in addition to the main camera, had to be implemented using more than one computer or an analog video mixer.

At present I think the only functions that are easily shared by room-to-room - and desktop are the audio and video transmission although the quality requirements are very different. All the other stuff, such as graphical interfaces and window management seem to cost more than it gives in terms of system resources. A shared whiteboard, such as a LiveBoard, could be used to share documents and camera control should be more intuitively integrated into the room to emphasize that it is a resource that is shared between all the receivers.

The second observation has to do with the requirements on perceived quality and transfer effects. A common problem when using desktop videoconferencing applications is that the frame formats supported by the applications have low resolution and gives a low perceived quality when scaled to fit on a TV - or projection screen. The users of desktop videoconferencing systems have shown a tremendous acceptance of lousy quality so far. It may be connected to the user's mental image of the system and it's (lack of) resources or it may be that in the case of new users there's nothing in the user's previous experience that is similar enough to tell him/her what to expect from the system. In a room-to-room videoconference, the user already have a strong expectation of what quality to expect, due to the similarity to TV broadcasts and cinemas. Unfortunately this means that to avoid a bad first impression (that may last) we must aim to provide at least the same video - and audio quality as provided by TV (in the case we use a large screen for display) or cinema (if we choose to use a video projector or backlit screen). The quality I'm talking about here is at least PAL/NTSC or preferably ITU-R 601 for display on a large TV screen and at least HDTV for display on a projector screen or backlit screen. This is the motivation of the definitions of high-quality videoconferencing in 3.1 above.

When it comes to audio, most desktop videoconferencing applications have some form of rudimentary silence suppression. The systems that I have seen all recommend the user to use headphones and sometimes special headsets or even telephone sets are included in the shipment. When extending these applications, as is, to a room-based setting you are bound to end up with loudspeakers and the rudimentary silence suppression in the software will not be sufficient for echo cancellation by far. The need for additional echo cancellation equipment is obvious and the silence suppression features of the applications in general just makes the audio break up.

The last observation has to do with the typical system resources available in the two different cases. Many adverts for room-based system use the principle of pooling of resources as justification for the higher price: A room-based system can serve more people at once, it's easier to protect the components from damage, theft and so forth, and it's easier to service. Therefore it's possible to spend more $ on better equipment and fancy features. Desktop-based systems on the other hand is intended to be used in an ordinary desktop computer and sharing resources with a lot of other activities. It should be easy to use, require minimal resources, it should be possible to run anywhere and it should be cheap to motivate widespread deployment.

3.2.2 Audio-video hardware

Most workstations and PCs can be equipped with hardware specialized for handling audio and video capture and playback. Most video capture cards supports capture of TV quality video. Some capture cards have on-board compression support, often optimized for VCR quality video [7, 35], while others don't. There are also stand-alone compression cards, such as the SGI cosmo compress. The capture cards may have support for scaling and pixel format conversion. On the other side of the system, display cards may or may not have hardware for handling video and so you might need a graphics accelerator card such as the one in [9]. Most standard configurations support graphics display with 8 bits pixel depth, with options for 24 bits. To be able to show graphics with 24 bit pixel depth using 8 bit hardware you need to do color mapping on each pixel, either in hardware or in software, possibly followed by some dithering algorithm to reduce the number of artefacts.

Handling video in the uncompressed domain is very data-intensive. The processing is quite simple and independent of the data, but the amount of data to be processed is huge. That is the reason why VIC delays decompression as far as possible [29] and why pixel operations are often supported in hardware close to capture and graphics display. The graphics can be displayed on a computer screen or sent to an analog out port. Digital and SGI supports analog video out.

Telephone quality audio is supported by most standard configurations and better quality is supported by optional audio cards. SGI supports sampling and playout up to CD quality audio. Most of the information that I have on audio- and video add-on hardware is quite old so I need to do a market survey to find the current status of add-on equipment.

3.2.3 Host limitations for videoconferencing

Because I want to use standard computers, I have to take into account the limitations of such platforms. Preliminary results and a wealth of literature points out the poor real-time characteristics of ordinary desktop computers and workstations. To overcome this I will use a combination of overprovisioning and static resource allocation [15] by distributing parts of the system over a number of independent machines according to functionality as suggested in 3.4.1. I intend to distribute the functionality according to the media separation level 4 described in Appendix D.4 to minimize the amount of time spent in context switching and repairing collisions on the LAN.

As can be seen in Appendix B, a high-end workstation using software-only compression in VIC cannot achieve the full framerate of a worst case video feed even with no local rendering and minimal other load. It seems that delay and data loss depends more on hardware - and system support for video operations than on which combination of compression scheme and quality level is used. I will do some complementary tests to find out more about the distribution of CPU and memory usage in VIC by inserting statistics collection points in the code to measure frame rate, number of skipped frames and the amount of time spent in different parts of the program. The code in VIC is already highly optimized to reduce computation requirements for performance-critical data handling, so I don't expect to be able to contribute much. Instead I think the greatest CPU - and memory usage saving is to be found in removing graphical interfaces, window managers, and whatever else I can find that have minimal use in the context of the functionality that the machine should support. A more detailed implementation plan is described in Appendix F.

3.3 Evaluation

By having a distributed end-system as decribed in Appendix F where each media flows independent of the other, a simpler model of the data path can be used. The model I will use is commonly used in textbooks and papers related to multimedia systems and consists of more or less independent blocks, all contributing to the end-to-end behavior.

A system model for describing the delay contribution of various elements involved in the transfer of continuous data from one point to another is given in Fig 1. End-to-end delay is an important parameter to measure. Another is end-to-end distortion, or data loss. The system model for data loss contribution is similar to the system model for delay. One reason for this similarity is that in a soft real-time system delay may contribute to data loss in the form of data discarded due to real-time constraints violations. A third important parameter is delay variation, also called jitter, that naturally has the same system model as delay. The conclusion is that the model in Fig. 1 is representative for all the important parameters in the definitions of high-quality audio and video in the preceding section.

Fig 1. Delay contribution model

3.3.1 Limitations of the model

We cannot use this model to predict the effects of differences in auxiliary factors such as setup-, appearance- and quality of analog equipment and the influence of non-audio, non-video functionality. As shown in the related works section these factors are important to the perceived end-to-end quality as well, but not relevant to the goal of the proposed work.

The model would also perform poorly if used, as is, to predict the effects of some multipoint communication issues. For example, to model multipoint communication without any transcoding we would need multiple independent copies of the left stack and the propagation boxes. As seen in section 2.2.6 we would need 3 to 8 independent copies. However, this problem can be turned into a pure networking issue. The base for this reasoning is that, if I ignore transportation factors, a signal from one sender is conceptually independent from signals from other senders. Therefore the requirements of a multipoint videoconference can be modeled as a sum the requirements of (3 to 8) * 2 one-way connections as in Fig. 1.

The placement and properties of mixers and transcoders in the network could ofcourse be incorporated as parameters in the propagation box, but the interdependancies and grouping of functionality of such a system wouldn't be visible and thus defeating the whole point of the model. I don't believe this issue is relevant to the work I propose to do.

3.4 Measurements

There are two stages where I need to do measurements; during the optimization of VIC and RAT and a final round to measure the network requirements of the resulting prototype system. To do this I have to find out how much delay, loss and jitter is introduced by each of the boxes above transmission and reception. As can be seen in Appendix C, I have found some typical values in the literature for some of the boxes but some of those values are old and most are platform dependent. The method I will use and the measurements I propose to do is generally describe below, and a more detailed evaluation plan is outlined in Appendix G.

3.4.1 Method

There are many different parameters in the model that one can tweak. Often data is converted many times between capturing and playback. The overhead introduced by these conversions affects both computation performance measurements and quality parameters related to delivery timeliness.

The content of the blocks above transmission and reception in Fig. 1 is largely media-dependent. In a videoconference the main media involved are audio and video, and depending on the level of separation (see Appendix D) these media follows more or less separate paths through the system. Therefore the media-related quality measurements can be split up into an audio-part and a video-part.

3.4.2 Audio measurements

Audio transmission have been around for a long time and consequently the ITU-T has accumulated a wealth of knowledge in this area, both in metrics and measurement methods. In [3, 5, 6, 16] we can find the delay introduced as well as the signal quality distortion (data loss), bit rate and computation complexity of some common codecs. In [5] we can find the recommended packetization delay.

Due to the impressive amount of work done in this area, the main effort from my part will be to interpret existing statistics and reuse them in my model. If time permits I will run some measurements and compare them to the values found in literature.

3.4.3 Video measurements

Digital video communication systems constitute a new and emerging technology that is about to be extensively deployed. The technology is mature, but testing of such systems has been neither formalized, nor extensively studied. However, there is a significant amount of effort in this area and I will use a mix of metrics and measurement methods as they emerge. I will use tools that implements distortion measurement schemes as the ones described in section 2.6.2 to measure end-to-end distortion. Then by changing parameters of the different blocks in the system model, I can measure the data loss and delay contribution of that parameter.

The delay and loss contribution, as well as the bit rate generated by a compression scheme depends heavily on the characteristics of the signal and the complexity of the compression algorithm. Therefore I will run tests with different video sequences and test compression schemes of different complexity. I will use one sequence of white noise as a worst case, one typical videoconference session and I will insert synthetic timestamps in the test sequences like the ones described in [34] to measure end-to-end delay. Another parameter in the coding&compression-block is the compression scheme-dependent quality-levels, for example the Independent JPEG group's 1-100 compression value. Since those levels doesn't follow the same scale, I will test lowest-, default-, and highest quality settings of each compression scheme.

Packet size is another parameter that one can tweak to trade protocol overhead for lesser delay or vice versa.

3.4.4 Limitations of the measurements

To cancel the influence of the network infrastructure, I will connect the two test machines with a single 100 BaseTX Ethernet. To be really sure that this minimal infrastructure doesn't affect measurements, I'll use the mgen [8] tool to measure loss, transmission - and reception delay, and jitter. Propagation delay is assumed to be insignificant.

The measurements suggested in 3.4.2 and 3.4.3 above are clearly platform dependent. To try to get results that are more general, I have to measure the performance of the platform at the time of the test. Maybe also perform measurements on different platforms.

3.5 Summary

This proposal assumes that room-to-room videoconferencing is one of the most demanding future services that a future network infrastructure has to support. It also assumes that the end users of such a system will be humans. Other limitations are the use of standard computers interconnected by an IP-based network. Following these assumptions I propose work to develop a room-based videoconferencing prototype providing a certain perceptual quality derived from work in HCI. The prototype will not be implemented from scratch, but by modifying an existing desktop videoconferencing system consisting of VIC and RAT.

Multipoint scaling issues and multiprogramming overhead is dealt with by using a distributed end system as opposed to an end host. This leaves optimization of the applications.

Another possible outcome of this work is the finding that the defined quality cannot be provided by a prototype developed under the above assumptions for some reason. In that case possible reasons for failure will be investigated and documented as well as the maximum quality that could be achieved.

3.6 Future work

An important part of the proposed work is based upon a theory that transfer effects and association play an important role in the decision to adopt a new medium. That this also is applicable to room-based videoconferencing and that room-based videoconferencing will be associated with TV and cinema is so far unproved.

As shown in the related works section the effects of differences in auxiliary factors such as setup-, appearance- and quality of analog equipment and the influence of non-audio, non-video functionality are important to the perceived end-to-end quality. There is a lot of prototype development efforts in this area, but not so many subjective measurements to compare the trade-offs between different solutions.

There are several interesting issues related to multipoint communications and video-audio applications that I will not handle in this proposed work. For example scaling issues, modeling and the interdependancies and grouping of functionality of such systems as well as the effects of different technical solutions on human communication.

Another question is what happens to H.323 call signalling when the end-systems are distributed? And will using separate multicast addresses for each sender break interoperability vith other RTP-based systems?

4 Action / Time plan / Milestones

The time plan is divided into a study part, an implementation part and an evaluation part. The study part doesn't have to be finished before the implementation part begins, but preferably before the evaluation part begins. The report writing will be a continuous activity spread over the whole time plan.

The preliminary thesis defence is planned June 15. A preliminary thesis should be available for the opponent to comment on at April 15.

4.1 Literature study

The study part will mainly consist of more study on metrics, subjective and objective quality measurement. I also include the market survey to find the current status of add-on equipment.

4.1.1 Literature study on measurement methods and parameters

I have done some work here that is accounted for in 3.4. I need to study more on metrics and do more reading on subjective - and objective quality measurements and what has already been done in this area for videoconferencing. 5 working days.

4.1.2 Host limitations and add-on equipment

I have done some work here that is accounted for in 3.4 and Appendix B. I need to study more on capacity of the chosen platforms. 5 working days.

4.2 System implementation

The implementation part consists of two main actions - some complementary tests on VIC and rat, and the prototype design and implementation.

4.2.1 Performance measurements of the current application

Here I have a lot of data already, some of which are described in Appendix B and some are available in [1, 29, 30]. I'm planning to do some complementary tests. 5 working days.

4.2.2 Suggest enhancements and develop prototype

The first step is to identify deficiencies of the current application and suggest solutions and enhancements. I have done some work here that is accounted for in Appendices B, D and E. 30 working days.

4.3 Evaluation

The evaluation part consists of completing the evaluation plan, doing the testing described in 3.4 and analyze the data. 10 working days.

4.4 Conclusion

Write conclusions and propose future work. 5 working days.

4.5 Writing

The report writing will be a continuous activity spread over the whole time plan. 15 working days.

5 Rough outline of the thesis

The tentative licenciate thesis topic is "A scalable solution for IP multicast based high speed multipoint to multipoint videoconferencing".

Background. The reason why I bother. Including material from sections 1.1, 2, 3 and 6 in this proposal.

Hypothesis. An elaboration on the ideas in section 3.

The tools. Descriptions of tools used and intro to terminology. Including material from sections 2 and 6 in this proposal.

The Application. Presentation of the resulting applications developed during the action described in 4.2.2.

Evaluation model. Presentation of the Evaluation model used in comparing the applications. This model is described in 3.3 and 3.4.

Results. Presentation of test results and discussion on implications and dependancies.

Conclusions. Does my hypothesis hold or not, or just a little, and why?

Future Work. What did I miss? What should be clarified more thoroughly? My suggestion of where we should head from this point. Including material from section 3.6.

6 Terminology and Concepts

This paper contains a lot of terminology from different fields and it's not easy to keep track of all of them. Due to the interdisciplinary nature of this piece of work, there naturally is no fixed terminology and thus every definition in this section may be subject to debate in some way or another. Many of the terms have quite different meaning in different disciplines, and I don't claim to have complete knowledge of all those. Instead, I hope this section will help in avoiding unnecessary misunderstanding when reading this paper.

Access delay. The time necessary at the source to wait for the medium to be available or for the network to be ready to accept the block of information [14].

Analog signals. Physical measures which varies continuously with time and/or space. They can be described by mathematical functions of the type s=f(t), s=f(x, y, z) or s=f(x, y, z, t). A sensor detects a physical phenomenon and transforms it into a measure, usually an electrical current or -tension. The measured values are expressed with an accuracy which is characteristic for the sensor. In signal processing the value is called amplitude [14].

Artifacts. Visible erors which appear unnatural [14].

Aspect ratio. The ratio of the width to the height of a frame. Expressed as X : Y where X and Y are the lowest natural numbers such that X/Y = x/y where x is the width and y is the height of the frame [14].

Bandwidth. A range of frequencies of a transmission media [14].

Bit rate guarantee. The type of guaranteed transmission capacity a network can give to an end-system. Can be either None, Reserved, Allocated or Dedicated [14].

Bitmap. A spatial two-dimensional matrix made up of individual pixels [14].

Burstiness. The degree of bit rate variation of a data stream. Common metrics are peak bit rate (PBR), the peak duration, the mean bit rate (MBR) and the ratio between MBR and PBR [14].

Bus. A single linear cable to which all end-systems are connected. When an end-system transmits a signal it propagates in both directions to all connected end-systems [14].

Chrominance. In analog broadcast television, chrominance signals are constructed by linear combinations of color difference signals [14].

Codec. A system that bundles both the functions of coding and decoding is called a coder-decoder, abbreviated codec [14].

Coding. Associate each quantized value with a group of binary digits, called a code-word [14].

Color difference signal. A color difference signal is formed by substracting the luminance signal from each of the primary color signals [14].

Compression. Compression refers to the algorithms used to reduce the bit rate of a digital signal [14].

Computer. By computer we mean any technology ranging from general desktop computer, to a large scale computer system, a process control system or an embedded system [11].

Computer display scan rate. The frequency at which the screen is refreshed by the electronics of the monitor. Usually in the order of 60 to 70 Hz [14].

Constant bit rate (CBR). Produces bits at a constant rate. Constant bit rate services can be used to carry continuous media and synchronous data [20].

Continuous media. Data are generated at a given, not necessarily fixed, rate independent of the network load and impose a timing relationship between sender and receiver, that is, the data should be delivered to the user with the same rate as it was generated at the receiver [20].

Data alteration. The most frequent alteration concerns inversion of bits, or loss of trailing or heading parts in data blocks or packets. In modern networks, alteration is the least frequent form of error [14].

Data duplication. The same data are received unexpectedly more than once by the receiver. This is a rather rare incident in practice [14].

Data loss. Data may be discarded by network components because of detected data alteration or most frequently due to internal network congestion affecting nodes or transmission lines [14].

Degradation mean opinion score (DMOS). Subjective measures including ratings of perceived quality degradation compared to an original. DMOS uses a five-grade scale ranging from 1, Very annoying, to 5, Inaudible. The DMOS value is extracted from the results of an Degradation Category Rated (DCR) test performed on 20 to 60 untrained persons [21].

Delay equalization. Also called delay compensation. The real-time transmission of continuous media over networks is very sensitive to delay variation. To overcome delay variations, an additional offset delay is inserted at the sink end to achieve a smooth playout. This technique may add a substantial component to the overall latency between the source and the final playout of the sound. In theory the delay offset should match the upper bound of the delay variation, but in practice interactive applications will require an upper bound on the end-to-end delay [14].

Digital signal. A time-dependent or space-dependent sequence of values coded in binary format [14].

Digitization. The transformation of analog signals into digital signals. Consists of Sampling, followed by Quantizing, followed by Coding. Also called encoding [14].

Dithering. Algorithms used to minimize visual artifacts caused by compression and image transformations [14].

Echo. The hearing mechanism normally filters out the echo of one's own voice when speaking. Unfortunately, this filter doesn't work if the echo is delayed long enough [14].

End-to-end delay. I use this as the time between capture/sampling and display/playout of a media.

Error rate. The error rate is a measure of the behaviour of the network with respect to alteration, loss, duplication, or out-of-order delivery of data. Metrics used are the bit error rate (BER), the packet error rate (PER), the cell error rate (CER), the packet loss rate (PLR) and the cell loss rate (CLR) [14].

Flow control waiting time. The time the source has to wait for the network to be ready before being authorized to transmit [14].

Frame. A complete and individual view, and part of a succession of displayed views [14].

Frame rate. The rate at which the frames are displayed in frames per second (fps). Also called temporal resolution [14].

Frame size. The number of pixels per frame. Denoted X * Y where X is the number of pixels per line and Y is the number of lines. Also called the spatial resolution and frame format [14].

Full connection. Every end-system is directly connected with a physical cable to all the others, requiring n2 - n cables to fully interconnect n systems [14].

Image components. A pixel is encoded using either luminance and chrominance signals (YIQ or YUV), luminance and color difference signals (YCrCb) or RGB signals. These building blocks are called image components by a common name.

Interaction. By interaction we mean any communication betweeen a user and computer, be it direct or indirect. Direct interaction involves a dialogue with feedback and control throughout performance of the task. Indirect interaction may involve background or batch processing [11].

Interlacing. Every frame is divided in two fields, the even field consists of the even-numbered lines and the odd field is composed of the odd-numbered lines of the frame. The resolution loss is in the order of one-third compared to progressive scan, but it saves bandwidth in analog broadcast [14].

Intermedia synchronization. Timing relationships between different streams is restored. A typical case of intermedia synchronization is synchronization between audio and motion video [14].

Intramedia synchronization. Timing relationships between elements in a media stream is restored within the individual streams at playout. Also called streaming [14].

Intra-stream dependencies. All compression mechanisms imply that blocks carry some form of updates, so that the data of a block generated at time t carries information affecting blocks generated within an interval [14].

Isochronism. An end-to-end network connection is said to be isochronous if the bit rate over the connection is guaranteed and if the value of the delay variation is guaranteed and small [14].

Lip-synchronization. Intermedia synchronization between audio and motion video [14].

Luminance. The cumulative response of the eye to all the wavelengths contained in a given source of light. Luminance is represented by the function , where is the spectral distribution and is the spectral response. Luminance is usually denoted by Y [14].

Mean opinion score (MOS). Subjective measures including ratings of perceived quality on a five-grade scale ranging from 1, Bad, to 5, Excellent. The MOS value is extracted from the results of an Absolute Category Rated (ACR) test performed on 20 to 60 untrained persons [21].

Medium access time. The time the source system has to wait for the transmission medium to be free [14].

Mesh. A set of interconnected stars with redundant interconnecting links, so that alternative routes exists between two end-systems [14].

Mirror effect. Most users find it disturbing if their captured image is displayed directly without left and right being inverted. People, when viewing their own faces, are accustomed to the mirror effect [14].

Multimedia conferencing. When a conference integrates text, graphics, or images-based conversation with audio or video-based dialog, it is called a multimedia conference [14].

Network connection set-up delay. The time it takes to set up an end-to-end connection between two end-systems. This only applies to those networks which are aware of end-to-end-system connections, such as ATM, ST-II, or satellite-based communications [14].

Network connection failure rate. Covers both the connection attempt rate and the failure rate of ongoing connections. This only applies to those networks which are aware of end-to-end-system connections, such as ATM, ST-II, or satellite-based communications [14].

Nyquist Theorem. The Nyquist classical theory requires that, to faithfully represent an analog signal, the sampling frequency should be equal to or greater than twice the highest frequency contained in the sampled signal. Studies have, however, shown that under certain circumstances, lower sampling frequencies can in practice be used [14].

Out-of-order delivery of data. Long-haul packet networks, in particular, may have alternate routes between two end-systems. When failures or congestion occur, alternate routes may be involved, and route oscillations may happen. As not all routes have the same transit delay, packets may be delivered in a different order than they were emitted [14].

Packet voice conversation. Same as computer-assisted telephony, but the underlying network is packet switched [14].

Perceived resolution. Determined by the frame size, the pixel depth, the frame rate and the subsampling scheme used.

Physical jitter. The variation of the delay generated by the transmission equipment, such as faulty behaviour of the repeater's reshape signals, crosstalk between cables may create interference, electronic oscillators may have phase noise and changes in propagation delay in metallic conductors due to temperature changes [14].

Pixel. Stands for picture element and is the smallest element of resolution of the image. Each pixel is represented by a numerical value, the amplitude. The number of bits available to code an amplitude is called the amplitude depth, pixel depth or color - or chroma resolution. The numerical value may represent black/white in bitonal images, level of gray in grayscale images or color attributes in a color image [14].

Playout. The process of transforming an digital representation of a signal into analog form [14].

Progressive scan. The screen is refreshed progressively line by line, each line being scanned from left to right [14].

Quantization. Converting the sampled values into a signal which can take only a limited number of values [14].

Real-time data. Real-time data imposes an upper bound on the delay between sender and receiver, that is, a message should be received by a particular deadline. Packets that miss their deadline are considered lost (late loss), just as if they had been dropped at a switch or router [20].

Rendering. The technique used for the display of digital still or moving images. Rendering refers to the process of generating device-dependent pixel data from device-independent sampled image data, including dithering [14].

Resolution. One parameter of resolution is the frame size, another parameter is the pixel depth [14].

Return trip delay. The time elapsing between the emission of the first bit of a data block and its reception by the same end-system after the block have been echoed by the destination end-system. Also called round-trip delay [14].

Red-Green-Blue (RGB). The Commission Internationale de l'Eclaire (CIE) has defined a Red-Green-Blue system by reference to three monochromatic colors. The respective wavelengths are Red = 700 nm, Green = 546 nm, and Blue = 436 nm. The television standards have adopted triplets which are generally slightly different from that of the CIE [14].

Ring. A single cable to which all end-systems are connected and with the ends of the cable connected to form a loop. When an end-system transmits a signal it propagates in only one direction [14].

Sampling. Retaining a discrete set of values from an analog signal. Also called capture [14].

Sampling rate. The periodicity of sampling is in general constant and called the sampling frequency or sampling rate [14].

Spectral distribution. Most light sources are composed of a range of wavelengths, each having its own intensity. This is called the spectral distribution of the light source and is represented by the function [14].

Spectral response. How sensitive the human eye is to light of a certain wavelength. The response of the human vision to a wavelength is represented by the function [14].

Star. All end-systems connect to a star point. At the star point, there is a system called a switch which can route the information from one cable to another [14].

Store-and-forward switching delays. Delays caused by internal node congestion [14].

Subsampling. Fewer samples per line is taken, and sometimes fewer lines per frame. The ratio between the sampling frequency of the luminance and the sampling frequency of each of the color difference signals have to be an integer, resulting in all components are sampled at locations extracted from a single grid. The notation is of the type; <Y sampling frequency>:<Cd sampling frequency> for a single color difference scheme, and <Y sampling frequency>:<Cd1 sampling frequency>:<Cd2 sampling frequency> for a luminance-chrominance scheme [14].

Synchronous data. Periodically generated bits, bytes or packets that have to be regenerated with exactly the same period at the receiver. Synchronous data has a constant bit rate [20].

Teleconferencing. Computer-based conferencing at a distance. A generic name for any application which supports real-time bidirectional conversation between two groups or several groups of people. Videoconferencing and shared whiteboards are examples of specific teleconferencing applications [14].

Throughput. The rate at which two ideal end-systems can exchange binary information. Also called bit rate, data rate, transfer rate and bandwidth. The unit is bits per second (bps). In the cases where networks only handle fixed-sized blocks, other units may be used too, e.g. cell rate in ATM networks [14].

Transit delay. The time elapsing between the emission of the first bit of a data block by the transmitting end-system and its reception by the receiving end-system. Also called latency [14]. If the end-systems are connected by a single link, then this is the same as the propagation delay of the medium.

Transit delay variation. The variation over time of the network transit delay. Usually measured as the difference between experienced delay and some target delay for the data flow. Other definitions are based on the difference between the longest and the shortest transit delays observed over a period of time. Also called jitter and delay jitter [14].

Transmission delay. The time necessary to transmit all the bits of a block. For a given block size this only depends on the acess delay [14].

Tree. A set of interconnected stars, so that only one route exists between two end-systems [14].

User. By user we mean an individual user, a group of users working together, or a sequence of users in an organization, each dealing with some part of the task or process. The user is whoever is trying to get the job done using the technology [11].

Video distribution. Traditional broadcast-like, one-way video.

Video format. Consists of resolution, frame rate, aspect ratio and subsampling scheme [14].

Videophony. Telephony with motion video. Videophones may be video-extended telephone sets, so called video dialtones, or a computer equipped with necessary hardware and software [14].

7 References

[1] S. McCanne, V. Jacobson, "vic: A Flexible Framework for Packet Video", CACM, November 1995.
[2] P. Bagnall, B. Briscoe, A. Poppitt, "Taxonomy of Communication Requirements for Large-scale Multicast Applications", Internet Engineering Task Force Request For Comments XXXX, 1998. Work in progress.
[3] International Telecommunication Union Telecommunication Standardization Sector (ITU-T) Recommendation G.114, "Transmission Systems and Media, General Characteristics of International Telephone Connections and International Telephone Circuits, One-Way Transmission Time", ITU-T, February 1996.
[4] K. E. Finn, A. J. Sellen, S. B. Wilbur, "Video-Mediated Communication", Lawrence Erlbaum Associates, Mahwah New Jersey, 1997.
[5] H. Schulzrinne, "RTP Profile for Audio and Video Conferences with Minimal Control", Internet Engineering Task Force Request For Comments 1890, 1996.
[6] S. Quan, C. Mulholland, T. Gallagher, "Packet Voice Networking", CISCO Systems, 1997.
[7] P. Bahl, P. S. Gauthier, R. A. Ulichney, "Software-only Compression, Rendering and Playback of Digital Video", Digital Technical Journal, Vol. 7, No. 4, 1995.
[8] The Naval Research Laboratory (NRL) "Multi-Generator" (MGEN) Toolset. URL: http://manimac.itd.nrl.navy.mil/MGEN/index.html
[9] L. D. Seiler, R. A. Ulichney, "Integrating Video Rendering into Graphics Accelerator Chips", Digital Technical Journal, Vol. 7, No. 4, 1995.
[10] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", Internet Engineering Task Force Request For Comments 1889, 1996.
[11] A. J. Dix, G. D. Abowd, R. Beale, J. E. Finley, "Human-Computer Interaction", 1st Ed., Prentice Hall, 1993.
[12] B. Cain, S. Deering, A. Thyagarajan, "Internet Group Management Protocol, version 3", Internet Engineering Task Force Request For Comments XXXX, 1997. Work in progress.
[13] J. F. Koegel Buford, "Multimedia Systems", Addison-Wesley, 1994.
[14] F. Fluckiger, "Understanding Networked Multimedia", Prentice-Hall, 1995.
[15] U. Schwantag, "An Analysis of the Applicability of RSVP", Diploma Thesis at the Institute of Telematics, Universität Karlsruhe, 1997.
[16] "Introduction to Packet Voice Networking", CISCO Systems, 1997.
[17] S. Deering, "Host Extensions for IP Multicasting", Internet Engineering Task Force Request For Comments 1112, 1989.
[18] T. Maufer, C. Semeria, "Introduction to IP Multicast Routing", Internet Engineering Task Force Request For Comments XXXX, 1997. Work in progress.
[19] T. P. De Miguel, S. Pavon, J. Salvachua, J. Q. Vives, P. L. C. Alonso, J. Fernandez-Amigo, C. Acuna, L. Rodriguez Yamamoto, V. Lagarto, J. Vastos, "ISABEL Experimental Distributed Cooperative Work Application over Broadband Networks", DIT/UPM Espana, 1993.
[20] H. Schulzrinne, "Internet Services: from Electronic Mail to Real-Time Multimedia", Kommunikation in Verteilten Systemen (KIVS) `95, Informatik aktuell Series, Springer Verlag, 1995.
[21] T. Poles, K. Elezaj, "Voice over IP/Ethernet", MSc Thesis, IT98/34, Department of Teleinformatics, Royal Institute of Technology, 1998.
[22] K. Fall, J. Pasquale, and S. McCanne, "Workstation Video Playback Performance with Competitive Process Load", Proceedings of the Fifth International Workshop on Network and OS Support for Digital Audio and Video. April, 1995. Durham, NH.
[23] http://www-nrg.ee.lbl.gov/vic/research.html
[24] T. Turletti, C. Huitema, " Videoconferencing on the Internet", IEEE/ACM Transactions on Networking, vol. 4, no. 3, June 1996.
[25] V. Jacobson, "Multimedia Conferencing on the Internet", SIGCOMM `94 Tutorial, University College London, 1994.
[26] F. Bock, H. Walter, M. Wilde , "A new distortion measure for the assessment of decoded images adapted to human perception", IWISP'96, Manchester, 1996.
[27] C. J. van den Branden Lambrecht, "Perceptual Models and Architectures for Video Coding Applications", PhD Thesis, Ecole polytechnique fédérale de Lausanne, 1996.
[28] I. Busse, B. Deffner, H. Schulzrinne, "Dynamic QoS Control of Multimedia Applications based on RTP", Computer Communications, Jan. 1996.
[29] S. R. McCanne, "Scalable Compression and Transmission of Internet Multicast Video", PhD Thesis, Report No. UCB/CSD-96-928, Computer Science Division University of California Berkeley, 1996.
[30] http://www.it.kth.se/labs/ts/bb/projinfo/victests/victest.txt
[31] http://www.it.kth.se/labs/ts/bb/projinfo/cosmonet/CosmoNet.html
[32] C. Katzeff, K. Skantz, "Talande huvuden och dubbningssjuka", publikation 97:02, Svenska Institutet för Systemutveckling, 1997.
[33] http//mash.cs.berkeley.edu/mash/
[34] A. B. Watson, "Toward a perceptual video quality metric", IS&T/SPIE Conference on Human Vision and Electronic Imaging III, San Jose, California, January 1998.
[35] SunVideo User's Guide, Sun Microsystems Computer Corporation, 1994.
[36] Stephen E. Deering, "Multicast Routing in a Datagram Internetwork", PhD thesis, Stanford University, 1991.
[37] http://www.mdlcorp.com/Insoft/Products/C/C.html
[38] Tim Dorcey, "CU-SeeMee desktop videoconferencing software", ConneXions, 9(3), March, 1995
[39] http://nic.merit.edu/net-research/mbone/.archive.html
[40] G. Karlsson, C. Katzeff, "THE ROLE OF VIDEO QUALITY IN COMPUTER MEDIATED COMMUNICATION", http://www.sisu.se/projects/video/index.html, 199?.

Appendix A. Multipoint techniques

A.1 Point-to-point mesh

Multipoint-to-multipoint service is provided by sender-based copying. This have obvious drawbacks for the senders that have to send multiple identical streams to each receiver.

Fig 2. Point-to-point mesh

A.2 Reflector node

Multipoint-to-multipoint service is offered by having one or more nodes in the session that copies incoming streams from the senders to all the receivers.

If you don't have control over the whole infrastructure, the placement of a reflector node can lead to unnecessary load on parts of the network.

Fig 3. The effect of a misplaced reflector node.

The maximum number of participating sites is bounded by the capacity of the reflector node.

A.3 IP multicast

Multipoint-to-multipoint service is offered by having the routers copying streams where the distribution tree branches. The distribution tree is built upon underlying unicast infrastructure using a multicast routing protocol. The maximum number of participating sites is bounded by the weakest link in the distribution tree.

Fig 4. IP multicast use router-based copying

A sender to a multicast group need not be a member of that group. IGMP v3 allows receivers to specify which senders it wants to listen to and which senders it doesn't want to listen to. Unfortunately IGMP is only used in the steps between sender/receiver and the nearest router, and thus cannot prune unwanted traffic from the whole system.

Fig 5. Drawback with IMGP v3 solution

Appendix B. Host limitations

In this Appendix I present a working draft of my investigation on possible bottlenecks situated in a host that is a part of an end system.

B.1 Supported video feeds

The most commonly supported video feed handled by capture cards and other computer circuitry is quarter sized PAL and quarter sized NTSC. The frame size is only 384x288 pixels for quarter size PAL and 320x240 pixels for quarter size NTSC and the quality provided is approximately the same as for a VCR. A quarter of the size of normal television.

HDTV (High Definition Television) seems a likely video feed in the near future. High-resolution format HDTV has 1920x1080 frame size and 60fps frame-rate. Also, the new H.263+ video compression standard supports frame sizes up to 2048x2048 pixels and MPEG 1&2 supports up to 4096x4096 pixels even though this is never implemented by vendors. ITU-R 601 studio quality video is a format that is accepted by many compression schemes. ITU-R 601 has different frame sizes depending on the source feed. For NTSC feed the frame size is 858x525 pixels and 30fps frame-rate, and for PAL and SECAM the frame size is 864x625 pixels and 25 fps frame-rate. Computer hardware for handling such video feeds is still not commonly available.

B.2 Internal bandwidth limits in the host

Consider that uncompressed non-interlaced 24-bit color VCR-quality PAL demands 384 * 288 * 24 * 25 = 65 Mbps to be copied from the decoder to the frame buffer for each received stream and the same amount from the capture device to the coder for each stream to be transmitted. This whilst receiving and/or sending compressed data over the network and running other programs.

A good 32-bits PCI-bus implementation can transfer sequential data at 80 to 100 Mbytes/s [9]. This means that for a rendering card, as the one described in [9], that cannot handle compressed data, between 8 to 10 % of the bus capacity goes to DMA transfer of the above uncompressed video to the rendering card. So a system with such a PCI bus and rendering card can render up to 7 to 9 of the above VCR-quality video streams. And this is only the rendering part. Add network traffic to and from memory, application programs, compression/decompression processes, swapping, and whatever the operating system might care to do.

In [9] is an example where transmitting thirty 320-by-240-pixel images per second with 16 bit color consumes about 5 % of the capacity of a good 32 bit standard PCI bus implementation. To display this data as 1280-by-960 images with 32 bit color on the screen would use more that 80 % of the bus bandwidth, if the scaling and pixel format conversion occurs in software.

B.3 Encoding - v.s. decoding capacity

Some desktop videoconferencing systems have hardware support for capture and coding, but decoding is done in software. One reason for this could be that the most commonly supported algorithms are asymmetric in the amount of processing needed for compression and decompression [7, 14]. Even though the JPEG algorithm is theoretically symmetric, the performance of the JPEG decoder is better than the encoder due to possible optimizations [7]. The point is that the use of hardware support often means that the videoconferencing system can code more than they can themselves decode. This is especially unfortunate when using the same solution for a high-quality multipoint videoconference, since the node only has to code one stream, but have to decode the streams from all other participants. In this case it seems more reasonable to have hardware support for decoding than for encoding.

B.4 Overload handling

I ran some tests on a couple of platforms that I currently have access to [30], and VIC showed up to a stunning 100% loss in the decoding- and rendering steps. Now, what cause this heavy loss in VIC?

On one of VIC's WWW-pages [23] S. McCanne states that VIC currently processes packets as soon as they arrive and renders frames as soon as a complete frame is received. Under this scheme, when a receiver can't keep up with a high-rate source, quality degrades drastically because packets get dropped indiscriminately at the kernel input buffer. In [22] K. Fall, J. Pasquale and S. McCanne found that when the playout scheduler in VIC gets too far behind (200 ms), it catches up by resetting the frame clock resulting in a burst of lost frames. This could be an explanation for the 100% loss. When the load goes over a certain threshold it will continuously force VIC to be more than 200 ms behind.

So what can you do to avoid getting swamped with data? One way is to drop load that you cannot handle. In [1] S. McCanne and V. Jacobson express their intention to incorporate a load shedding scheme similar to the one described in [22] to allow the application to gracefully adapt to available CPU resources. A large component of the decoding CPU budget goes into rendering frames, whether copying bits to the frame buffer, performing color space conversion, or carrying out a dither [1, 7, 9, 22, 23]. Accordingly, the VIC rendering modules were designed so that their load can be adapted by running the rendering process at a rate lower than the incoming frame rate. E.g. a 10 fps H.261 stream can be rendered, say, at 4 fps if necessary [23]. Alternatively, the decoding process itself can be scaled (if possible). For example, a layered decoder can run faster by processing a reduced number of layers at the cost of lower image quality [23] or the decoder may employ arithmetic approximations for faster decoding at lower quality [22]. While the hooks for scaling the decoder process are in place in VIC, the control algorithm isn't [22, 23].

Another solution is to use layered coding in combination with layered transmission and only subscribe to a quality corresponding to an amount of data that you can handle and thus avoid unnecessarily using up bandwidth [29]. The same effect can also be achieved using RTCP Receiver Reports [10] to force the sender(s) to adjust its transmission rate [24, 28]. Resulting in adjusting the received quality for all participants to the capacity of the least able receiver.

The simplest solution, however, is to use over-provisioning and load distribution on as many machines as needed. This is the solution that I suggest is best suitable to room-to-room videoconferencing.

B.5 Rendering capacity

In most desktop videoconferencing solutions the video that you capture and send is shown on the local screen. According to [1] image rendering sometimes accounts for 50% or more of the execution time. In [7] were found that between 27 - 60 % of the CPU time on a Digital 266 MHz Alphastation with PCI bus where needed for software-only decompression and playback of MPEG-1, MJPEG and Indeo video and that rendering accounts for around one third of the decompression and playback time. Since these papers were written in 1994 -1995 I had to check if this still holds with today's platforms.

In this test I used a Sun Ultra2 Creator 3D with a SunVideo capture card and a Sun camera giving a PAL feed to the capture card. The software used were the UCB/LBL VIdeoConferencing tool (VIC), which is widely used on the Multicast Backbone (MBone). When capturing video and transmitting it in compressed form it is possible to display the captured video. Note that no decompression was done on the captured video, only conversion from PAL to a format suitable for rendering. How this operation is implemented in VIC is not specified in [1], [22] or [23].

After doing some worst case tests I found that the performance of the tool degrades up to 56% on the given platform depending on the size of the rendered picture and the codec used. To check how the quality degrades I incrementally increased the size of the image. The results from the tests are shown in Fig 8 and Fig 9 below.

The codecs tested were H.261 with default(10) - and maximum(1) quality, nv with default(2) - and maximum(0) quality and jpeg with default(30) - and maximum(95) quality. Other codecs supported by the VIC tool was nvdct and cellb, but these were found to give too low subjective picture quality compared to the other codecs to be considered in the test.

Fig 6. Framerate degradation due to local rendering

Fig 7. Transmission rate degradation due to local rendering

I also found that the frame rate degradation varied somewhat depending on which coding standard was used. The deviation between the most degraded - and the least degraded coding scheme was 6% giving a range between 50 % - 56 % maximum framerate degradation. The bit rate degradation varied between 0 % - 61% maximum degradation. The constant bitrate for MJPEG can be explained as due to the hardware support for JPEG compression. That the frame rate is falling also for MJPEG is harder to explain, but is probably a consequence of the overload handling of the rendering part of VIC as reported in [22]. For nv and H.261 the maximum bit rate degradation varied between 56 % - 61 %.

Since the captured video may take a different path from capturing to rendering than video delivered from the decode-part of VIC, one should do a corresponding test where the video is received in coded form from another machine.

As a final note, rendering of the captured video on your local screen is mostly unnessesary, since the image could as easily be taken directly from the camera(s) and shown on a separate display without encumbering the encoder machine.

B.6 Delay contributions

In [3, 5, 6] we can find the delay introduced by some audio coding algorithms used in telecommunications today and in [5] we can find the recommended RTP packetization delay. I have no reference to delay introduced by video codecs, but many references in [4, 32] that suggest that the video codec delay is so much larger than the audio codec delay that if you delay the audio to keep audio and video synchronised, it will severely affect the intelligibility of a conversation. According to [3] this means delays larger than 400 ms end-to-end.

To cope with jitter and packet reordering on IP-based networks, the receiving applications usually employ playout buffers, which introduces delay. The playout buffer can be fixed size, or can adjust it's size to current jitter measurements [20], acceptable late loss [20], one-way trip time estimations, acceptable total loss, and the mean packet size so far. According to [22] the frames may be scheduled for rendering with intervals between less than 33 ms up to 200 ms, setting an upper bound on the jitter tolerance from the decoding step in VIC.

In spite of light workload some sender hosts of realtime traffic have been found to cause significant delay variation due to non-real-time properties of some operating systems. Therefore the host would need to run a real-time operating system to be able to provide a predictable QoS [15]. This doesn't mean that using a real-time operating system ensures timely playout, only that to be able to support predictions you need a real-time operating system.

Appendix C. System parameters

In this Appendix I present a listing of system parameters and their typical or recommended values found in literature.

C.1 Network parameters

C.1.1 Delay

Access delay
Connection set-up delay
Flow control waiting time
Medium access time
Return trip delay
Store-and-forward switching delays
Transit delay 0 to 150 ms transit delay is acceptable for most user applications, 150 to 400 ms is acceptable provided that Administrations are aware of the impact on the quality of user applications (for example international connections with satellite hops), above 400 ms acceptable in some exceptional cases (double satellite hops, videotelephony over satellite, temporary fixes) [3]. The ITU-TS has defined 24 ms as the upper limit of the one-way transit delay beyond which additional echo canceling techniques have to be employed [14].
Transmission delay

C.1.2 Data loss

Connection failure rate
Data alteration
Data loss
Error rate Optical fiber transmission systems have a BER less than 10-9. Satellite digital circuits have a BER between 10-6 and 10-8. A bit error rate of 10-9 will on average affect 1 frame per 3 hours in a 128 kbps video stream, 1 frame every 4 minutes in a 3 Mbps video stream and 1 frame per minute in a 20 Mbps video stream [14].

C.1.3 Delay variation (jitter)

Out-of-order delivery of data
Physical jitter Over long-distance circuits its typical value is in the order of microseconds. Over high-speed optical-based technologies a value of about 6 nanoseconds is targeted [14].
Transit delay variation With typical personal computers and workstations, the variation of the network transit delay should not exceed 100 ms for CD quality compressed sound and 400 ms for telephone quality speech. For HDTV quality video less than 50 ms, for broadcast TV quality less than 100 ms and for VHS quality less than 400 ms [14].

C.1.4 Data rate

Data duplication
Throughput
Bandwidth guarantee
Isochronism

C.2 Computation performance

Computer display scan rate The frequency at which the screen is refreshed by the electronics of the monitor. Usually in the order of 60 to 70 Hz [14].
Bus bandwidth requirements Transmitting thirty 320-by-240-pixel images per second with 16 bit color consumes about 5 % of the capacity of a good 32 bit standard PCI bus implementation. To display this data as 1280-by-960 images with 32 bit color on the screen would use more that 80 % of the bus bandwidth, if the scaling and pixel format conversion occurs in software [9].
Inter-application computation requirements According to [1] image rendering sometimes accounts for 50% or more of the execution time.
Inter-frame Display Time (IDT) Inter-frame Display Time (IDT) should be constant with very low variance (jitter) to maintain smooth media playback. IDT variance is strongly correlated to the level of compute-bound competitive load [22].
Inter-machine computation requirements For playback of a Source Input Format (SIF - 352 by 240 pixels) M-JPEG video on a 266 MHz Alphastation with PCI bus and hardware video rendering support using 40 percent of CPU. With software-only rendering 60 percent of CPU was used. For playback of a SIF MPEG-1 video on the same platform and 38 - and 58 percent of CPU was used and for SIF INDEO video and only 21 - and 27 percent of CPU was needed [7].
Inter-scheme computation requirements For M-JPEG, MPEG-1 and INDEO, 35 to 40 percent of total computation went to rendering and display for each codec, while the distribution of the remaining 60 to 65 percent could not be compared since the codecs have different components. Inverse Discrete Cosine Transform (IDCT) in MJPEG and motion compensation in MPEG-1 was the second most computation intensive components [7].
Operating system overhead In [7] P. Bahl, P. S. Gauthier and R. A. Ulichney compared the performance of compression when reading raw video from disk instead of capturing it, and found that they got a 5 percent performance increase for M-JPEG, and 33 percent increase for INDEO. The authors suggest this increase depends on overhead resulting from context switching in the operating system and the scheduling of sequential capture operations by the applications.

C.3 Media quality parameters

Audio signal quality In [14] F. Fluckiger defines three typical levels of sound quality: Telephone quality (G.711, 3.4 kHz bandwidth, 8 kHz sampling rate), CD-quality (monophonic 20 kHz bandwidth, 44 kHz sampling rate), Sound studio quality (monophonic 40 kHz bandwidth, 80+ kHz sampling rate).
Bit rate HDTV quality video compressed with MPEG-2 requires 25 to 34 Mbps. Studio quality video compressed with MPEG-2 requires 3 to 6 Mbps, with existing JPEG-based products 8 - 10 Mbps. Broadcast quality video compressed with MPEG-2 requires 2 to 4 Mbps, with existing JPEG-based products 6 to 8 Mbps. VCR quality video compressed with MPEG-1 requires 1.2 Mbps. Videoconferencing quality video compressed with H.261 requires 100 kbps. Telephone quality audio requires between 4 - 64 kbps, CD quality audio requires between 192 - 1411 kbps [14].
Bit error rate In the case of presentation to human users the bit error rate of a telephone-quality audio stream should be lower than 10-2. The bit error rate of a CD-quality audio stream should be lower than 10-3 in the case of an uncompressed format and lower than 10-4 in the case of a compressed format. The bit error rate of compressed video streams of HDTV quality video should not exceed 10-10, for broadcast TV quality it should be lower than 10-9 and for VCR quality less than 10-8. If forward error correction (FEC) is used the rates given can be increased by a factor of 10 000 [14].
Burstiness The burstiness ratio of a compressed video stream may reach 10 to 1 [14].
Compression delay The delays introduced by the end-systems, to compress and to decompress the audio and video streams, are very significant, in the order of 1 or even several seconds [14]. The delays introduced by telecommunication hardware for coding and decoding is typically: PCM (G.712) codec 0.75 ms, ADPCM (G.721, G.726, G.727) codec 0.25 ms, LD-CELP (G.728) codec 2.0 ms, CS-ACELP (G.729 8 kbps) codec 15 - 30 ms, H.260-series in the order of several hundred milliseconds [3].
Delay Highly dependant on application area. Voice conversation demands between 100 to 500 ms one-way. The impression of presence is extremely sensitive to delay in reaction to user inputs. The total elapsed time between a user action and the sensory feedback should generally be less than 100 ms [14]. G. Karlsson and C. Katzeff states that the video and sound may be delayed 150 ms without causing much disturbance [40].
Frame rate At framerates lower than 10 fps, the impression is more that of a succession of individual images. Between 10 to 16 fps, the viewer has an impression of motion but will still feel a jerky effekt. Above 15 or 16 fps the impression of smooth motion begins. Increasing the frame rate beyond 16 fps will progressively improve the comfort of viewing, particularly for rapid movements and significant changes between images [14].
Inter-media synchronization The video may antecede the sound by up to 100 ms (or succeed it by less than 10 ms) [40]. The difference between the playout of the sound and the display of the video should not exceed 100 ms [14].
Jitter Since jitter can be eliminated by delay equalization, an applications jitter tolerance depends on the maximum end-to-end delay that the particular application tolerates and the buffering capability of the receiving system [14].
Number of artifacts A quality judged "good" by a viewer requires that no more than one frame be affected every 4 minutes in broadcast TV quality and every 10 minutes for HDTV quality [14].
Persistence of artifacts
Motion search buffering Several frames have to be buffered at the source before being transmitted to search for redundancies between them, which increases the overall delay. For example, if five frames have to be buffered on average, the buffering will arithmetically generate an additional 200 ms delay in the PAL or SECAM standard [14].
Video signal quality In [14] F. Fluckiger defines five classes of video quality: HDTV quality (1920*1080, 24+ bit color, 60 fps, 16:9 aspect ratio), Studio quality (CCIR-601, 4:3 aspect ratio), Broadcast TV quality (PAL/NTSC, 4:3 aspect ratio), VCR quality (half PAL/NTSC, 4:3 aspect ratio), Videoconference quality (low-speed videoconferencing - 128 kbps, CIF, 5-10 fps).

Appendix D. Levels of flow separation

In this Appendix I present a few different "levels" of separation of the incoming and outgoing traffic flows to a certain end-system. The levels are not completely orthogonal, but allows for a fair number of combinations.

D.1 Separate transport streams

Most videoconferencing applications use separate transport streams for audio, video and data. This allows for treating each stream independently, which can be useful to allow for prioritization of audio over video in case of lack of bandwidth or system resources. The drawback is that you lose inter-stream synchronisation and introduce more protocol overhead for the separate transport streams.

D.2 Separate applications

Some videoconferencing systems use completely different applications for audio and video to allow the user to decide how much he want to reveal of himself. In this case, to prioritize one media type before another, one have to rely on some lower level support, for example network-based QoS and the multitasking algorithms in the end hosts' operating system. VIC [1] is one example of this solution.

D.3 Separater sender - and receiver parts of an end-system

Next step on this scale might be to separate the sender part of a videoconferencing application from the receiver part of the application and locate these on different hosts. In this way we relieve the sender machine of the burden of decoding and relieve the receiver machine of the burden of encoding. One drawback is that in spite of light workload some sender hosts of realtime traffic have been found to cause significant delay variation due to non-real-time properties of some operating systems [15].

D.4 Separate network connections

The second last step is to have separate links between the sender/receiver and the nearest router. Based upon the assumption that a router is better suited to handle packet filtering than a typical host. One example of a videoconferencing system implementing this solution is the CosmoNet [31]. In [15] U. Schwantag states that in many cases it is impossible to give end-to-end guarantees because the LANs of the sender and the receiver don't support guaranteed QoS. To have separate links between the sender/receiver and the nearest router is a combination of over-provisioning and static resource allocation as defined in [15]. Another solution would be to use a LAN technology that supports QoS guarantees, e.g. IsoEthernet or ATM. In the case of Ethernet this grade of separation gives a significantly better utilization of the link between the sender/receiver and the nearest router since the main delay contribution is due to collisions [21]. By having only one sender on the Ethernet segment this delay contribution is expected to disappear.

The main drawback with having separate sender - and receiver links is that it will introduce acoustic loopback problems in combination with IP multicast and may preclude some echo cancellation methods. One example of such a problem is the problem with local transmission echo as described in Appendix E.

D.5 Separate networks

The last step is to have separate links between all senders and receivers. Analog video systems is one example of this solution [4]. The obvious benefit here is that we will have optimal transmission characteristics from each sender to each receiver, while the obvious drawback is the scalability problems inherent in a mesh topology.

Appendix E. Local transmission echo

All of the videoconferencing applications that I have evaluated assumes that the origins and sinks of the transport streams for a certain media and for a certain participant coincide at the same end host. There is also a connection-oriented thinking in the design of the standards used today, such as RTP and H.32x.

In RTP, identification of different streams is done through a Synchronisation Source identifier (SSRC) that is unique within the session and randomly generated. Through the RTCP Source Description (SDES) message this SSRC identifier can be connected to a Canonical Endpoint identifier (CNAME) that uniquely identifies a participant within all sessions. The CNAME identifier "should" have the format user@host, or just host.Two examples given in [10] are doe@sleepy.megacorp.com and doe@192.0.2.89.

Normally, the network interface card doesn't echo back the data it has just transmitted. This helps to avoid local echo when using the same network interface card for transmission and reception. Now, if you use separate network interface cards for transmission and reception, then you need some other mechanism to filter out your own data.

One way to solve this could be to filter out received data with the same IP address as any of the interfaces in the host. However, this won't work in multipoint communication using IP multicast, where all senders send to a certain D-address and all receivers receive from that same address.

Another solution is to use the SSRC identifier in RTP to filter out data from the same host, but this would only work after a RTCP SDES-message has been issued by the application and received again. The problem here is that the RTCP-traffic is recommended to take only 5% of the total session and the interval between RTCP packets is required to be more than 5 seconds. This means that if a collision of SSRC-identifiers should occur, and both the senders choose new random SSRC's, it would take at least 5 seconds to reinitialize the filter. Meanwhile the user would be harassed by an annoying echo.

You could also use a different multicast address for each user. Doing this you lose interoperability with common MBone sessions where normally one multicast address per media type is shared by all participants. There is also a risk that the network interface card (NIC) will run out of multicast address filtering slots and then the filtering would be carried out in the OS kernel.

A common drawback with the above solutions is that the filtering end up being done in the software and thereby unneccessarily competing for resources with other processes. This may not be a problem at all when using completely different links to the nearest router from the sender and the receiver in combination with the above solution with multiple multicast addresses or in combination with using the Internet Group Management Protocol (IGMP) v3 [12] for IP multicast group management where the receiver can specify the IP addresses of the sources it wants to receive data from and which ones it doesn't want to see data from. In these cases all filtering will be done in the router and will not encumber the end host with unnecessary processing as well as conserve bandwidth on the links to the sender and the receiver.

Appendix F. Implementation plan

F.1 The end-system

There are several reasons why I decided to use an end-system like the one in Fig 8 instead of an end-host.

The router is better at multiplexing and demultiplexing than the hosts. In [15] U. Schwantag states that in many cases it is impossible to give end-to-end guarantees because the LANs of the sender and the receiver don't support guaranteed QoS. In the case of Ethernet the main delay contribution is due to collisions [21], therefore having only one sender and one receiver on each segment should minimize this delay contribution. Since the audio, video and document handling parts of a videoconference are independent of each others, distributing functionality over a cluster of machines should minimize the amount of resources spent on context switching and other multiprogramming overhead in the end-hosts. The sender- and receiver parts of the audio- and video functions are independent as well, allowing for further specialization into separate sender- and receiver machines. This supports the one-way directed link described in item 2 above as well as further optimizations of the applications as described below. A distributed design as described in the two preceding items allows for an arbitrary number of receiver machines to cope with scaling problems related to the number of additional end-systems sending traffic. In short this design minimize the multiprogramming overhead and the influence of multipoint conferencing on scalability.

Enabling technologies are

IP multicast allows sending to a multicast address without having to join the distribution tree. This enables item 2 above.
IGMP v.3 allows for joining a subset of the sources in a multicast group. This enables item 5 above.

Fig 8. Proposed end-system design

F.2 The Sender Machine

The sender machine includes the main building blocks as depicted in Fig 9. These building blocks are the same for both audio and video. The contents of the blocks are dependent on the media, however.

Fig 9. Sender machine main blocks

An ideal sender machine would not include anything but the functionality shown in Fig 9, but unfortunately a typical desktop computer includes a lot of other processes. All the other junk that are running on the same machine is called competitive load and should be minimized. If looking at my workstation there is a lot of competitive load (Fig 10) that are more or less automatically run and that take a certain amount of resources (Fig 11). By minimizing the number of processes I minimize the overhead for context-switching, paging and other operating system overhead related to process-handling.

Fig 10. Example of Competitive load using ps

Fig 11. Competitive load resource demand using top

If I were to optimize my machine for supporting only the functionality in Fig 9, I could free up to about 19 MByte memory and 13 percent CPU time used for window management, X display management and sendmail. I guess Sun's X Imaging Library demands that the X server is running to perform scaling and compression operations, but I have to test to be sure. If not, then I can free an additional 5.7 MByte and 30 percent CPU time. Thus in an audio sender machine, where I don't need any graphics, I can free up to a total 24.7 MByte of memory and 43 percent CPU time.

F.3 The Sender Application

The applications that I will modify are called VIC and RAT, and handles functionality in the two middle boxes in Fig 9 and also part of the arrow into these boxes. The scope of the applications is shown in Fig 12.

Fig 12. The scope of the sender application

The current version of VIC incorporates both the sender part and the receiver part in one program. The source code consists of a number of C++ objects handling the data-intensive parts together with a corresponding set of Tcl/Tk procedures allowing for rapid prototyping. This is a good trade-off when prototyping new CSCW-applications since the appearance can easily be changed and new functionality can easily be "glued" into the program using Tcl/Tk scripts. This combination have been further generalized in the MASH project [33].

As seen in Appendix B, a high-end workstation using software-only compression in VIC cannot achieve the full framerate of a worst case video feed even with no local rendering and a competitive load comparable to the one described in Fig 11.

I will do some complementary tests to find out more about the distribution of CPU and memory usage in VIC as well as the time spent in different parts of the program. The code in VIC is already highly optimized to reduce computation requirements for performance-critical data handling, so I don't expect to be able to contribute much there. Instead I think the greatest CPU - and memory usage saving is to be found in splitting VIC into two separate programs, one sender part and one receiver part, and peel away all but the necessary code from the sender part to support the functionality in Fig 12. Examples of code worth removing are graphical interfaces, awareness-related functionality such as members lists and other features based on RTCP receiver reports. Hopefully this will result in a smaller code size with more localized execution resulting in more cache-hits and faster execution.

The amount of resources used for handling audio is significantly less than for video, so I will not spend any time on optimizing RAT for using less resources, instead I will concentrate on minimizing delay contribution. Preliminary subjective observations using headphones, a microphone and running RAT on the loopback interface indicate a significant delay is introduced in the sender- and receiver parts of RAT. I will do some complementary tests to find out more about the delay contribution of different parts of the program by inserting statistics collection points in the code.

I will also try splitting RAT into two separate programs, sender and receiver, and see if any significant decrease in delay can be achieved. As in VIC I will peel away all but the necessary code from the sender part to support the functionality in Fig 12 to get a smaller program size and a more localized and thus faster execution.

The problem with local echo have been identified and solutions have been proposed in Appendix E.

F.4 The Receiver Machine

The receiver machine includes the main building blocks as depicted in Fig 13. These building blocks are the same for both audio and video. Like the sender machine the contents of the blocks are dependent on the media.

Fig 13. Receiver machine main blocks

As for the sender machine described above, the competitive load in the receiving machine should be minimized. Also similar to the sender case, in the audio receiving machine, no X or graphics is needed, but in the video receiver machine the rendering functions used by the video receiver application needs an X server to be running.

A new problem here is how to handle received streams from different senders. In the audio receiving machine all incoming audio information is multiplexed into one outgoing signal according to sampling time, but if you try to do this with incoming video information the outgoing signal will be incomprehensible. Instead you have to either scale the incoming video streams and then combine them into an outgoing signal, as in Fig 14, or you send the incoming video information to different outgoing interfaces as in Fig 15. The overhead introduced by constructing a combined outgoing signal in software - and the price in image quality distortion is too high for this to be a valid solution for handling multiple incoming video streams. If this is the preferred output format one can use an analog video mixer.

If many incoming streams are received, the amount of incoming data can become too much for a single receiver machine to handle or it can run out of outgoing interfaces or it can reach the limit for how many times an outgoing signal can be split. To cope with this problem one can use multiple receiver machines and distribute the streams over the machines using IGMP v.3.

Fig 14. Combined outgoing signal

Fig 15. Separate interfaces

F.5 The Receiver Application

The scope of the receiving parts of VIC and RAT is shown in Fig 9. RTP datagrams are received from the TCP/IP stack, the data is decoded and queued for playout or rendering in a timely manner.

Fig 16. The scope of the receiver application

As shown in Appendix B, the main consumer of system resources in VIC is rendering while in most video compression schemes decompression is much easier than compression. Therefore I think that in VIC, most of the possible optimization is to be found in the Signal reconstruction box in Fig 16. As for the sender part of VIC, I will peel away all but the necessary code from the sender part to support the functionality in Fig 16. Examples of code worth removing are, like in the sender case, graphical interfaces and awareness-related functionality such as members lists and other features based on RTCP receiver reports. Again I hope this will result in a smaller code size with more localized execution resulting in more cache-hits and faster execution. I will also modify VIC to send directly to the video-out interface, thereby eliminating the extra display on screen followed by screen capture used today.

As for the sender part of RAT, I will not spend any time on optimizing RAT for using less resources, but rather concentrate on minimizing delay contribution. The delay contribution of different parts of the program will be investigated by inserting statistics collection points in the code and I will peel away all but the code necessary to support the functionality in Fig 16 to get a smaller program size and a more localized and thus faster execution.

F.6 Additional modifications

The internal frame formats supported in VIC are too small. For example using H.261, all incoming images are scaled to CIF format before compression and after decompression the images can be scaled to any of a set of supported formats for display on the receiving size. These "unneccessary" scaling operations takes time and introduces a lot of signal distortion. I will investigate the possibility to use larger frame formats (as supported by the compression schemes) internally to reduce the distortion and if possible also eliminate the scaling operations.

Appendix G. Evaluation methodology

As said in section 4.2.1, I need to do some complementary tests of VIC as part of the prototype development to find bottlenecks in VIC and in the platform used. The evaluation plan below describes how I intend to perform these evaluations. When the prototype have shown to be able to deliver high quality audio and video, a somewhat simpler evaluation, treating the sender- and receiver parts as black boxes, will be performed to determine the requirements on the network part.

The end-to-end metrics that I have found important in a high-quality videoconference system are delay, delay variation and signal distortion. The end-to-end factors that I can vary are video - and audio signal characteristics, compression scheme, compression quality and hardware- and software configuration of the sender- and receiver machines.

When trying to construct an evaluation plan for the whole system I found the complexity overwhelming. The system consists of three subsystems, sender, network and receiver, with slightly different parameter sets. Therefore, I decided to evaluate the sender- and receiver parts separately. When determining the network requirement there is yet another important metric - data rate. To determine the bounds of this metric I only have to measure the data rate produced by the sender.

The different performance evaluations thus will be

Delay and delay variation in sender
Delay and delay variation in receiver
Distortion in sender
Distortion in receiver
Data rate produced by sender

Evaluation plans for those test runs will be described below according to the 10 recommended steps for a performance evaluation study as can be found in R. Jain: The Art of Computer Systems Performance Analysis.

Currently I only have a detailed plan for evaluating delay and delay variation in the sender part of VIC as shown below. Evaluation plans for the other four parts will follow the same outline.

G.1 Delay in sender

The goal of the study is to find the distribution of the time spent in the sender part of a video transmission system.

The system can be described as a flow of data from a capture point through a number of transformations to a transmission point.

The system boundaries are the arrow from analog and the arrow from transmission.

Administrative boundaries: No control over hardware and low-level software components. Only software compression and digital signal processing considered.

G.2 List services and outcomes

The services are:

Sampling & digitization
Coding & compression
Packetization
Transmission

G.2.1 Done correctly

Sampling & digitization.

Time: Time between analog signal in -> digital signal out
Productivity: Samples per time unit
Resource: CPU, Memory and Bus utilization

Coding & compression

Time: Time between first bit of frame in -> last bit of compressed frame out
Productivity: Frames per time unit
Resource: CPU, Memory and Bus utilization

Packetization

Time: Time between start of new packet
Productivity: Packets built per time unit
Resource: CPU, Memory and Bus utilization

Transmission

Time: Time between received first bit of packet -> successful transmission of last bit of packet
Productivity: Packets transmitted per time unit
Resource: CPU, Memory and Bus utilization

G.2.2 Done incorrectly

Not applicable

G.2.3 Cannot do

The sender cannot deliver. Duration. Time between failure.

hardware failure
OS failure
application failure

The receiver cannot receive. Duration. Time between failure.

hardware failure
OS failure
application failure

G.2.4 Select Metrics

Speed

Delay between Sampling & digitization and Coding & compression. Time between request for frame -> first bit of frame delivered.
Delay in Coding & compression. Time between first bit of frame in -> last bit of compressed frame out.
Delay in Packetization. Time between start of new packet.
Delay in Transmission. Time between received first bit of packet -> successful transmission of last bit of packet.
Frame rate received at receiver.
Propagation delay in network.
Reception delay in receiver.
Total delay. Time between analog signal in -> reception at receiver

Accuracy not applicable

Availability. Duration and time between failure of

Hardware
OS
Application

G.3 List Parameters

System parameters

Speed of CPU
Speed of Memory
Bus bandwidth
Compression standard
Compression quality
Packet size
Network bandwidth

Workload parameters

Analog video standard
Amount of motion in clip
Color distribution in clip
Competitive load - CPU
Competitive load - Memory
Competitive load - Bus

G.4 Select Factors to Study

Compression standard.

NV
H.261
H.263
MPEG-1

Compression quality. Compression scheme dependent quality level.

Lowest
Default
Highest

Packet size.

Range

Amount of motion in clip.

Low
Typical
High

Color distribution in clip. Local variance is

Low
Typical
High

G.5 Select Evaluation technique

This performance evaluation follows a very simplified analytical modeling and a preliminary measurement that was used to obtain the parameters and factors and define suitable system boundaries. Therefore I will do measurement only. The evaluation goal is deemed not suitable for simulation.

G.6 Select Workload

The competitive load will be minimized. Five analog video clips in PAL format will be fed to the system. These video clips will be have different characteristics as shown in the table below..

Table 1: Video clips

Motion\\ Color	Low	Typical	High
Low	Black	Background
Typical		Meeting
High		Entrance	White noise

Overhead caused by the network and receiver will be measured using MGEN.

G.7 Design experiments

Video processing takes a lot of system resources. If the scarcity of system resources significantly affects the measurements, then the evaluation should be aborted. For this I will use the frame rate received at receiver. I will measure frame rate for all combinations of compression standards and -qualities and video clips. If, for a certain combination, the frame rate is significantly lower than 25 fps, the system resources is lacking and all experiments with this combination will be aborted. If more than 50% of the combinations is aborted, then the evaluation is aborted.

Then I will determine network- and receiver overhead for different packet sizes. The packet size causing the least average propagation- and receiver delay will then be used in subsequent experiments.

Next I will measure total delay for all combinations of compression standards and -qualities, for each of the five video clips, for five rounds resulting in 175 experiments. If variance is too high, then five more rounds will be conducted. If still too high after that, the evaluation is aborted.

Three combinations will be chosen for monitoring the remaining metrics (Delay in...); the combinations resulting in lowest total delay, highest total delay and median total delay. I will also measure total delay during the experiments to determine the overhead caused by the monitors.

Last I will measure sustained frame rate and burst size distribution at the receiver for the three combinations.

G.8 Analyze and Interpret Data

Delay between sampling & digitization and coding & compression is expected to have low variability. Therefore a mean value will be used.

Delay in coding & compression is difficult to predict but depends on combinations of video clip, compression scheme and compression quality. I will try to fit the values to a distribution and compute confidence interval.

Delay in Packetization depend on the packet size and the bit rate produced by the compression scheme. Since the bit rate is variable while the packet size is constant, the distribution of Delay in Packetization will follow the bit rate distribution. Maybe also show similarities to the distribution of Delay in coding & compression. I'll check that if time allows.

Delay in Transmission will probably be very small and with low variability. However, I have seen strange behaviour in this part before so if time allows a mean value together with standard deviation would allow me to sleep well at night.

Propagation delay in network and the Reception delay in receiver is expected to be small and with low variability. Therefore a mean value will be used.

Total delay will probably vary a lot for different combinations of compression standards and -qualities and video clips. Mean values and standard deviation for each combination over all runs should do the trick. The C.O.V. should also be considered.

G.9 Present Results

Most of the data will be plottable and some, such as delay in Coding & compression as well as Packetization are expected to show similarities. The most interesting results are total delay minus propagation- and reception delay and should be plotted as a function of the combinations of compression standards and -qualities and video clips.

Maintained by Tobias Öbrink