Read, Watch, Listen: A commentary on eye tracking and moving images – Tim J. Smith


Eye tracking is a research tool that has great potential for advancing our understanding of how we watch movies. Questions such as how differences in the movie influences where we look and how individual differences between viewers alters what we see can be operationalised and empirically tested using a variety of eye tracking measures. This special issue collects together an inspiring interdisciplinary range of opinions on what eye tracking can (and cannot) bring to film and television studies and practice. In this article I will reflect on each of these contributions with specific focus on three aspects: how subtitling and digital effects can reinvigorate visual attention, how audio can guide and alter our visual experience of film, and how methodological, theoretical and statistical considerations are paramount when trying to derive conclusions from eye tracking data.



I have been obsessed with how people watch movies since I was a child. All you have to do is turn and look at an audience member’s face at the movies or at home in front of the TV to see the power the medium holds over them. We sit enraptured, transfixed and immersed in the sensory patterns of light and sound projected back at us from the screen. As our physical activity diminishes our mental activity takes over. We piece together minimal audiovisual cues to perceive rich otherworldly spaces, believable characters and complex narratives that engage us mentally and move us emotionally. As I progressed through my education in Cognitive Science and Psychology I was struck by how little science understood about cinema and the mechanisms filmmakers used to create this powerful experience.[i] Reading the film literature, listening to filmmakers discuss their craft and excavating gems of their craft knowledge I started to realise that film was a medium ripe for psychological investigation. The empirical study of film would further our understanding of how films work and how we experience them but it would also serve as a test bed for investigating complex aspects of real-world cognition that were often considered beyond the realms of experimentation. As I (Smith, Levin & Cutting, 2010) and others (Anderson, 2006) have argued elsewhere, film evolved to “piggy back” normal cognitive development and use basic cognitive tendencies such as attentional preferences, theory of mind, empathy and narrative structuring of memory to make the perception of film as enjoyable and effortless as possible. By investigating film cognition we can, in turn advance our understanding of general cognition. But to do so we need to step outside of traditional disciplinary boundaries concerning the study of film and approach the topic from an interdisciplinary perspective. This special issue represents a highly commendable attempt to do just that.

By bringing together psychologists, film theorists, philosophers, vision scientists, neuroscientists and screenwriters this special issue (and the Melbourne research group that most contributors belong to) provides a unique perspective on film viewing. The authors included in this special issue share my passion for understanding the relationship between viewers and film but this interest manifests in very different ways depending on their perspectives (see Redmond, Sita, and Vincs, this issue; for a similar personal journey into eye tracking as that presented above). By focussing on viewer eye movements the articles in this special issue provide readers from a range of disciplines a way into the eye tracking investigation of film viewing. Eye tracking (as comprehensively introduced and discussed by Dyer and Pink, this issue) is a powerful tool for quantifying a viewer’s experience of a film, comparing viewing behaviour across different viewing conditions and groups as well as testing hypotheses about how certain cinematic techniques impact where we look. But, as is rightly highlighted by several of the authors in this special issue eye tracking is not a panacea for all questions about film spectatorship.

Like all experimental techniques it can only measure a limited range of psychological states and behaviours and the data it produces does not say anything in and of itself. Data requires interpretation. Interpretation can take many forms[ii] but if conclusions are to be drawn about how the data relates to psychological states of the viewer this interpretation must be based on theories of psychology and ideally confirmed using secondary/supporting measures. For example, the affective experience of a movie is a critical aspect which cognitive approaches to film are often wrongly accused of ignoring. Although, cognitive approaches to film often focus on how we comprehend narratives (Magliano and Zacks, 2011), attend to the image (Smith, 2013) or follow formal patterns within a film (Cutting, DeLong and Nothelfer, 2010) several cognitivists have focussed in depth on emotional aspects (see the work of Carl Plantinga, Torben Grodal or Murray Smith). Eye tracking is the perfect tool for investigating the impact of immediate audiovisual information on visual attention but it is less suitable for measuring viewer affect. Psychophysiological measures such as heart rate and skin conductance, neuroimaging methods such as fMRI or EEG, or even self-report ratings may be better for capturing a viewer’s emotional responses to a film as has been demonstrated by several research teams (Suckfull, 2000; Raz et al, 2014). Unless the emotional state of the viewer changed where they looked or how quickly they moved their eyes the eye tracker may not detect any differences between two viewers with different emotional states.[iii]

As such, a researcher interested in studying the emotional impact of a film should either choose a different measurement technique or combine eye tracking with another more suitable technique (Dyer and Pink, this issue). This does not mean that eye tracking is unsuitable for studying the cinematic experience. It simply means that you should always choose the right tool for the job and often this means combining multiple tools that are strong in different ways. As Murray Smith (the current President of the Society for the Cognitive Study of the Moving Images; SCSMI) has argued, a fully rounded investigation of the cinematic experience requires “triangulation” through the combination of multiple perspectives including psychological, neuroscientific and phenomenological/philosophical theory and methods (Smith, 2011) – an approach taken proudly across this special issue.

For the remainder of my commentary I would like to focus on certain themes that struck me as most personally relevant and interesting when reading the other articles in this special issue. This is by no means an exhaustive list of the themes raised by the other articles or even an assessment of the importance of the particular themes I chose to select. There are many other interesting observations made in the articles I do not focus on below but given my perspective as a cognitive scientist and current interests I decided to focus my commentary on these specific themes rather than make a comprehensive review of the special issues or tackle topics I am unqualified to comment on. Also, I wanted to take the opportunity to dispel some common misconceptions about eye tracking (see the section ‘Listening to the data’) and empirical methods in general.

Reading an image

One area of film cognition that has received considerable empirical investigation is subtitling. As Kruger, Szarkowska and Krejtz (this issue) so comprehensively review, they and I believe eye tracking is the perfect tool for investigating how we watch subtitled films. The presentation of subtitles divides the film viewing experience into a dual- task: reading and watching. Given that the media was originally designed to communicate critical information through two channels, the image and soundtrack introducing text as a third channel of communication places extra demands on the viewer’s visual system. However, for most competent readers serially shifting attention between these two tasks does not lead to difficulties in comprehension (Kruger, Szarkowska and Krejtz, this issue). Immediately following the presentation of the subtitles gaze will shift to the beginning of the text, saccade across the text and return to the centre of interest within a couple of seconds. Gaze heatmaps comparing the same scenes with and without subtitles (Kruger, Szarkowska and Krejtz, this issue; Fig. 3) show that the areas of the image fixated are very similar (ignoring the area of the screen occupied by the subtitles themselves) and rather than distracting from the visual content the presence of subtitles seems to actually condense the gaze behaviour on the areas of central interest in an image, e.g. faces and the centre of the image. This illustrates the redundancy of a lot of the visual information presented in films and the fact that under non-subtitle conditions viewers rarely explore the periphery of the image (Smith, 2013).

My colleague Anna Vilaró and I recently demonstrated this similarity in an eye tracking study in which the gaze behaviour of viewers was compared across versions of an animated film, Disney’s Bolt (Howard & Williams, 2008) either in the original English audio condition, a Spanish language version with English subtitles, an English language version with Spanish subtitles and a Spanish language version without subtitles (Vilaró, & Smith, 2011). Given that our participants were English speakers who did not know Spanish these conditions allowed us to investigate both where they looked under the different audio and subtitle conditions but also what they comprehended. Using cued recall tests of memory for verbal and visual content we found no significant differences in recall for either types of content across the viewing conditions except for verbal recall in the Spanish-only condition (not surprisingly given that our English participants couldn’t understand the Spanish dialogue). Analysis of the gaze behaviour showed clear evidence of subtitle reading, even in the Spanish subtitle condition (see Figure 1) but no differences in the degree to which peripheral objects were explored. This indicates that even when participants are watching film sequences without subtitles and know that their memory will be tested for the visual content their gaze still remains focussed on central features of a traditionally composed film. This supports arguments for subtitling movies over dubbing as, whilst placing greater demands on viewer gaze and a heightened cognitive load there is no evidence that subtitling leads to poorer comprehension.

Figure 1: Figure from Vilaró & Smith (2011) showing the gaze behaviour of multiple viewers directed to own language subtitles (A) and foreign language/uninterpretable subtitles (B).

Figure 1: Figure from Vilaró & Smith (2011) showing the gaze behaviour of multiple viewers directed to own language subtitles (A) and foreign language/uninterpretable subtitles (B).

The high degree of attentional synchrony (Smith and Mital, 2013) observed in the above experiment and during most film sequences indicates that all visual features in the image and areas of semantic significance (e.g. social information and objects relevant to the narrative) tend to point to the same part of the image (Mital, Smith, Hill and Henderson, 2011). Only when areas of the image are placed in conflict through image composition (e.g. depth of field, lighting, colour or motion contrast) or staging (e.g. multiple actors) does attentional synchrony break down and viewer gaze divide between multiple locations. Such shots are relatively rare in mainstream Hollywood cinema or TV (Salt, 2009; Smith, 2013) and when used the depicted action tends to be highly choreographed so attention shifts between the multiple centres of image in a predictable fashion (Smith, 2012). If such choreographing of action is not used the viewer can quickly exhaust the information in the image and start craving either new action or a cut to a new shot.

Hochberg and Brooks (1978) referred to this as the visual momentum of the image: the pace at which visual information is acquired. This momentum is directly observable in the saccadic behaviour during an images presentation with frequent short duration fixations at the beginning of a scene’s presentation interspersed by large amplitude saccades (known as the ambient phase of viewing; Velichovsky, Dornhoefer, Pannasch and Unema, 2000) and less frequent, longer duration fixations separated by smaller amplitude saccades as the presentation duration increases (known as the focal phase of viewing; Velichovsky et al., 2000). I have recently demonstrated the same pattern of fixations during viewing of dynamic scenes (Smith and Mital, 2013) and shown how this pattern gives rise to more central fixations at shot onset and greater exploration of the image and decreased attentional synchrony as the shot duration increases (Mital, Smith, Hill and Henderson, 2011). Interestingly, the introduction of subtitles to a movie may have the unintended consequence of sustaining visual momentum throughout a shot. The viewer is less likely to exhaust the information in the image because their eyes are busy saccading across the text to acquire the information that would otherwise be presented in parallel to the image via the soundtrack. This increased saccadic activity may increase the cognitive load experienced by viewers of subtitled films and change their affective experience, producing greater arousal and an increased sense of pace.

For some filmmakers and producers of dynamic visual media, increasing the visual momentum of an image sequence may be desirable as it maintains interest and attention on the screen (e.g. Michael Bay’s use of rapidly edited extreme Close-Ups and intense camera movements in the Transformer movies). In this modern age of multiple screens fighting for our attention when we are consuming moving images (e.g. mobile phones and computer screens in our living rooms and even, sadly increasingly at the cinema) if the designers of this media are to ensure that our visual attention is focussed on their screen over the other competing screens they need to design the visual display in a way that makes comprehension impossible without visual attention. Feature Films and Television dramas often rely heavily on dialogue for narrative communication and the information communicated through the image may be of secondary narrative importance to the dialogue so viewers can generally follow the story just by listening to the film rather than watching it. If producers of dynamic visual media are to draw visual attention back to the screen and away from secondary devices they need to increase the ratio of visual to verbal information. A simple way of accomplishing this is to present the critical audio information through subtitling. The more visually attentive mode of viewing afforded by watching subtitled film and TV may partly explain the growing interest in foreign TV series (at least in the UK) such as the popularity of Nordic Noir series such as The Bridge (2011) and The Killing (2007).

Another way of drawing attention back to the screen is to constantly “refresh” the visual content of the image by either increasing the editing rate or creatively using digital composition.[iv] The latter technique is wonderfully exploited by Sherlock (2010) as discussed brilliantly by Dwyer (this issue). Sherlock contemporised the detective techniques of Sherlock Holmes and John Watson by incorporating modern technologies such as the Internet and mobile phones and simultaneously updated the visual narrative techniques used to portray this information by using digital composition to playfully superimpose this information onto the photographic image. In a similar way to how the sudden appearance of traditional subtitles involuntarily captures visual attention and draws our eyes down to the start of the text, the digital inserts used in Sherlock overtly capture our eyes and encourage reading within the viewing of the image.

If Dwyer (this issue) had eyetracked viewers watching these excerpts she would have likely observed this interesting shifting between phases of reading and dynamic scene perception. Given that the appearance of the digital inserts produce sudden visual transients and are highly incongruous with the visual features of the background scene they are likely to involuntarily attract attention (Mital, Smith, Hill & Henderson, 2012). As such, they can be creatively used to reinvigorate the pace of viewing and strategically direct visual attention to parts of the image away from the screen centre. Traditionally, the same content may have been presented either verbally as narration, heavy handed dialogue exposition (e.g. “Oh my! I have just received a text message stating….”) or as a slow and laboured cut to close-up of the actual mobile phone so we can read it from the perspective of the character. Neither takes full advantage of the communicative potential of the whole screen space or our ability to rapidly attend to and comprehend visual information and audio information in parallel.

Such intermixing of text, digital inserts and filmed footage is common in advertisements, music videos, and documentaries (see Figure 2) but is still surprisingly rare in mainstream Western film and TV. Short-form audiovisual messages have recently experienced a massive increase in popularity due to the internet and direct streaming to smartphones and mobile devices. To maximise their communicative potential and increase their likelihood of being “shared” these videos use all audiovisual tricks available to them. Text, animations, digital effects, audio and classic filmed footage all mix together on the screen, packing every frame with as much info as possible (Figure 2), essentially maximising the visual momentum of each video and maintaining interest for as long as possible.[v] Such videos are so effective at grabbing attention and delivering satisfying/entertaining/informative experiences in a short period of time that they often compete directly with TV and film for our attention. Once we click play, the audiovisual bombardment ensures that our attention remains latched on to the second screen (i.e., the tablet or smartphone) for its duration and away from the primary screen, i.e., the TV set. Whilst distressing for producers of TV and Film who wish our experience of their material to be undistracted, the ease with which we pick up a handheld device and seek other stimulation in parallel to the primary experience may indicate that the primary material does not require our full attention for us to follow what is going on. As attention has a natural ebb-and-flow (Cutting, DeLong and Nothelfer, 2010) and “There is no such thing as voluntary attention sustained for more than a few seconds at a time” (p. 421; James, 1890) if modern producers of Film and TV want to maintain a high level of audience attention and ensure it is directed to the screen they must either rely on viewer self-discipline to inhibit distraction, reward attention to the screen with rich and nuanced visual information (as fans of “slow cinema” would argue of films like those of Bela Tarr) or utilise the full range of postproduction effects to keep visual interest high and maintained on the image, as Sherlock so masterfully demonstrates.

Figure 2: Gaze Heatmaps of participants’ free-viewing a trailer for Lego Indiana Jones computer game (left column) and the Video Republic documentary (right column). Notice how both make copious use of text within the image, as intertitles and as extra sources of information in the image (such as the head-up display in A3). Data and images were taken from the Dynamic Images and Eye Movement project (DIEM; Mital, Smith, Hill & Henderson, 2010). Videos can be found here ( and here (

Figure 2: Gaze Heatmaps of participants’ free-viewing a trailer for Lego Indiana Jones computer game (left column) and the Video Republic documentary (right column). Notice how both make copious use of text within the image, as intertitles and as extra sources of information in the image (such as the head-up display in A3). Data and images were taken from the Dynamic Images and Eye Movement project (DIEM; Mital, Smith, Hill & Henderson, 2010). Videos can be found here ( and here (

A number of modern filmmakers are beginning to experiment with the language of visual storytelling by questioning our assumptions of how we perceive moving images. Forefront in this movement are Ang Lee and Andy and Lana Wachowski. In Ang Lee’s Hulk (2003), Lee worked very closely with editor Tim Squyers to use non-linear digital editing and after effects to break apart the traditional frame and shot boundaries and create an approximation of a comic book style within film. This chaotic unpredictable style polarised viewers and was partly blamed for the film’s poor reception. However, it cannot be argued that this experiment was wholly unsuccessful. Several sequences within the film used multiple frames, split screens, and digital transformation of images to increase the amount of centres of interest on the screen and, as a consequence increase pace of viewing and the arousal experienced by viewers. In the sequence depicted below (Figure 3) two parallel scenes depicting Hulk’s escape from a containment chamber (A1) and this action being watched from a control room by General Ross (B1) were presented simultaneously by presenting elements of both scenes on the screen at the same time. Instead of using a point of view (POV) shot to show Ross looking off screen (known as the glance shot; Branigan, 1984) followed by a cut to what he was looking at (the object shot) both shots were combined into one image (F1 and F2) with the latter shot sliding into from behind Ross’ head (E2). These digital inserts float within the frame, often gliding behind objects or suddenly enlarging to fill the screen (A2-B2). Such visual activity and use of shots-within-shots makes viewer gaze highly active (notice how the gaze heatmap is rarely clustered in one place; Figure 3). Note that this method of embedding a POV object shot within a glance shot is similar to Sherlock’s method of displaying text messages as both the glance, i.e., Watson looking at his phone, and the object, i.e., the message, are shown in one image. Both uses take full advantage of our ability to rapidly switch from watching action to reading text without having to wait for a cut to give us the information.

Figure 3: Gaze heatmap of eight participants watching a series of shots and digital inserts from Hulk (Ang Lee, 2003). Full heatmap video is available at

Figure 3: Gaze heatmap of eight participants watching a series of shots and digital inserts from Hulk (Ang Lee, 2003). Full heatmap video is available at

Similar techniques have been used Andy and Lana Wachowski’s films including most audaciously in Speed Racer (2008). Interestingly, both sets of filmmakers seem to intuitively understand that packing an image with as much visual and textual information as possible can lead to viewer fatigue and so they limit such intense periods to only a few minutes and separate them with more traditionally composed sequences (typically shot/reverse-shot dialogue sequences). These filmmakers have also demonstrated similar respect for viewer attention and the difficulty in actively locating and encoding visual information in a complex visual composition in their more recent 3D movies. Ang Lee’s Life of Pi (2012) uses the visual volume created by stereoscopic presentation to its full potential. Characters inhabit layers within the volume as foreground and background objects fluidly slide around each other within this space. The lessons Lee and his editor Tim Squyers learned on Hulk (2003) clearly informed the decisions they made when tackling their first 3D film and allowed them to avoid some of the issues most 3D films experience such as eye strain, sudden unexpected shifts in depth and an inability to ensure viewers are attending to the part of the image easiest to fuse across the two eye images (Banks, Read, Allison & Watt, 2012).

Watching Audio

I now turn to another topic featured in this special issue, the influence of audio on gaze (Robinson, Stadler and Rassell, this issue). Film and TV are inherently multimodal. Both media have always existed as a combination of visual and audio information. Even early silent film was almost always presented with either live musical accompaniment or a narrator. As such, the relative lack of empirical investigation into how the combination of audio and visual input influences how we perceive movies and, specifically how we attend to them is surprising. Robinson, Stadler and Rassell (this issue) have attempted to address this omission by comparing eye movements for participants either watching the original version of the Omaha beach sequence from Steven Spielberg’s Saving Private Ryan (1998) or the same sequence with the sound removed. This film sequence is a great choice for investigating AV influences on viewer experience as the intensity of the action, the hand-held cinematography and the immersive soundscape all work together to create a disorientating embodied experience for the viewer. The authors could have approached this question by simply showing a set of participants the sequence with audio and qualitatively describing the gaze behaviour at interesting AV moments during the sequence. Such description of the data would have served as inspiration for further investigation but in itself can’t say anything about the causal contribution of audio to this behaviour as there would be nothing to compare the behaviour to. Thankfully, the authors avoided this problem by choosing to manipulate the audio.

In order to identify the causal contribution of any factor you need to design an experiment in which that factor (known as the Independent Variable) is either removed or manipulated and the significant impact of this manipulation on the behaviour of interest (known as the Dependent Variable) is tested using appropriate inferential statistics. I commend Robinson, Stadler and Rassell’s experimental design as they present such an manipulation and are therefore able to produce data that will allow them to test their hypotheses about the causal impact of audio on viewer gaze behaviour. Several other papers in this special issue (Redmond, Sita and Vincs; Batty, Perkins and Sita) discuss gaze data (typically in the form of scanpaths or heatmaps) from one viewing condition without quantifying its difference to another viewing condition. As such, they are only able to describe the gaze data, not use it to test hypotheses. There is always a temptation to attribute too much meaning to a gaze heatmap (I too am guilty of this; Smith, 2013) due to their seeming intuitive nature (i.e., they looked here and not there) but, as in all psychological measures they are only as good as the experimental design within which there are employed.[vi]

Qualitative interpretation of individual fixation locations, scanpaths or group heatmaps are useful for informing initial interpretation of which visual details are most likely to make it into later visual processing (e.g. perception, encoding and long term memory representations) but care has to be taken in falsely assuming that fixation equals awareness (Smith, Lamont and Henderson, 2012). Also, the visual form of gaze heatmaps vary widely depending on how many participants contribute to the heatmap, which parameters you choose to generate the heatmaps and which oculomotor measures the heatmap represent (Holmqvist, et al., 2011). For example, I have demonstrated that unlike during reading visual encoding during scene perception requires over 150ms during each fixation (Rayner, Smith, Malcolm and Henderson, 2009). This means that if fixations with durations less than 150ms are included in a heatmap it may suggest parts of the image have been processed which in actual fact were fixated too briefly to be processed adequately. Similarly, heatmaps representing fixation duration instead of just fixation location have been shown to be a better representation of visual processing (Henderson, 2003). Heatmaps have an immediate allure but care has to be taken about imposing too much meaning on them especially when the gaze and the image are changing over time (see Smith and Mital, 2013; and Sawahata et al, 2008 for further discussion). As eye tracking hardware becomes more available to researchers from across a range of disciplines we need to work harder to ensure that it is not used inappropriately and that the conclusions that are drawn from eye tracking data are theoretically and statistically motivated (see Rayner, 1998; and Holmqvist et al, 2013 for clear guidance on how to conduct sound eye tracking studies).

Given that Robinson, Stadler and Rassell (this issue) manipulated the critical factor, i.e., the presence of audio the question now is whether their study tells us anything new about the AV influences on gaze during film viewing. To examine the influence of audio they chose two traditional methods for expressing the gaze data: area of interest (AOI) analysis and dispersal. By using nine static (relative to the screen) AOIs they were able to quantify how much time the gaze spent in each AOI and utilise this measure to work out how distributed gaze was across all AOIs. Using these measures they reported a trend towards greater dispersal in the mute condition compared to the audio condition and a small number of significant differences in the amount of time spent in some regions across the audio conditions.

However, the conclusions we can draw from these findings are seriously hindered by the low sample size (only four participants were tested, meaning that any statistical test is unlikely to reveal significant differences) and the static AOIs that did not move with the image content. By locking the AOIs to static screen coordinates their AOI measures express the deviation of gaze relative to these coordinates, not to the image content. This approach can be informative for quantifying gaze exploration away from the screen centre (Mital, Smith, Hill and Henderson, 2011) but in order to draw conclusions about what was being fixated the gaze needs to be quantified relative to dynamic AOIs that track objects of interest on the screen (see Smith an Mital, 2013). For example, their question about whether we fixate a speaker’s mouth more in scenes where the clarity of the speech is difficult due to background noise (i.e., their “Indistinct Dialogue” scene) has previously been investigated in studies that have manipulated the presence of audio (Võ, Smith, Mital and Henderson, 2012) or the level of background noise (Buchan, Paré and Munhall, 2007) and measured gaze to dynamic mouth regions. As Robinson, Stadler and Rassell correctly predicted, lip reading increases as speech becomes less distinct or the listener’s linguistic competence in the spoken language decreases (see Võ et al, 2012 for review).

Similarly, by measuring gaze dispersal using a limited number of static AOIs they are losing considerable nuance in the gaze data and have to resort to qualitative description of unintuitive bar charts (figure 4). There exist several methods for quantifying gaze dispersal (see Smith and Mital, 2013, for review) and even open-source tools for calculating this measure and comparing dispersal across groups (Le Meur and Baccino, 2013). Some methods are as easy, if not easier to calculate than the static AOIs used in the present study. For example, the Euclidean distance between the screen centre and the x/y gaze coordinates at each frame of the movie provides a rough measure of how spread out the gaze is from the screen centre (typically the default viewing location; Mital et al, 2011) and a similar calculation can be performed between the gaze position of all participants within a viewing condition to get a measure of group dispersal.

Using such measures, Coutrot and colleagues (2012) showed that gaze dispersal is greater when you remove audio from dialogue film sequences and they have also observed shorter amplitude saccades and marginally shorter fixation durations. Although, I have recently shown that a non-dialogue sequence from Sergei Eisenstein’s Alexander Nevsky (1938) does not show significant differences in eye movement metrics when the accompanying music is removed (Smith, 2014). This difference in findings points towards interesting differences in the impact diegetic (within the depicted scene, e.g. dialogue) and non-diegetic (outside of the depicted scene, e.g. the musical score) may have on gaze guidance. It also highlights how some cinematic features may have a greater impact on other aspects of a viewer’s experience than those measureable by eye tracking such as physiological markers of arousal and emotional states. This is also the conclusion that Robinson, Stadler and Rassell come to.    

Listening to the Data (aka, What is Eye Tracking Good For?)

The methodological concerns I have raised in the previous section lead nicely to the article by William Brown, entitled There’s no I in Eye Tracking: How useful is Eye Tracking to Film Studies (this issue). I have known William Brown for several years through our attendance of the Society for Cognitive Studies of the Moving Image (SCSMI) annual conference and I have a deep respect for his philosophical approach to film and his ability to incorporate empirical findings from the cognitive neurosciences, including some references to my own work into his theories. Therefore, it comes somewhat as a surprise that his article openly attacks the application of eye tracking to film studies. However, I welcome Brown’s criticisms as it provides me with an opportunity to address some general assumptions about the scientific investigation of film and hopefully suggest future directions in which eye tracking research can avoid falling into some of the pitfalls Brown identifies.

Brown’s main criticisms of current eye tracking research are: 1) eye tracking studies neglect “marginal” viewers or marginal ways of watching movies; 2) studies so far have neglected “marginal” films; 3) they only provide “truisms”, i.e., already known facts; and 4) they have an implicit political agenda to argue that the only “true” way to study film is a scientific approach and the “best” way to make a film is to ensure homogeneity of viewer experience. I will address these criticisms in turn but before I do so I would like to state that a lot of Brown’s arguments could generally be recast as an argument against science in general and are built upon a misunderstanding of how scientific studies should be conducted and what they mean.

To respond to Brown’s first criticism that eye tracking “has up until now been limited somewhat by its emphasis on statistical significance – or, put simply, by its emphasis on telling us what most viewers look at when they watch films” (Brown, this issue; 1), I first have to subdivide the criticism into ‘the search for significance’ and ‘attentional synchrony’, i.e., how similar gaze is across viewers (Smith and Mital, 2013). Brown tells an anecdote about a Dutch film scholar who’s data had to be excluded from an eye tracking study because they did not look where the experimenter wanted them to look. I wholeheartedly agree with Brown that this sounds like a bad study as data should never be excluded for subjective reasons such as not supporting the hypothesis, i.e., looking as predicted. However, exclusion due to statistical reasons is valid if the research question being tested relates to how representative the behaviour of a small set of participants (known as the sample) are to the overall population. To explain when such a decision is valid and to respond to Brown’s criticism about only ‘searching for significance’ I will first need to provide a brief overview of how empirical eye tracking studies are designed and why significance testing is important.

For example, if we were interested in the impact sound had on the probability of fixating an actor’s mouth (e.g., Robinson, Stadler and Rassell, this issue) we would need to compare the gaze behaviour of a sample of participants who watch a sequence with the sound turned on to a sample who watched it with the sound turned off. By comparing the behaviour between these two groups using inferential statistics we are testing the likelihood that these two viewing conditions would differ in a population of all viewers given the variation within and between these two groups. In actual fact we do this by performing the opposite test: testing the probability that that the two groups belong to a single statistically indistinguishable group. This is known as the null hypothesis. By showing that there is less than a 5% chance that the null hypothesis is true we can conclude that there is a statistically significant chance that another sample of participants presented with the same two viewing conditions would show similar differences in viewing behaviour.

In order to test whether our two viewing conditions belong to one or two distributions we need to be able to express this distribution. This is typically done by identifying the mean score for each participant on the dependent variable of interest, in this case the probability of fixating a dynamic mouth AOI then calculating the mean for this measure across all participants within a group and their variation in scores (known as the standard deviation). Most natural measures produce a distribution of scores looking somewhat like a bell curve (known as the normal distribution) with most observations near the centre of the distribution and an ever decreasing number of observations as you move away from this central score. Each observation (in our case, participants) can be expressed relative to this distribution by subtracting the mean of the distribution from its score and dividing by the standard deviation. This converts a raw score into a normalized or z-score. Roughly ninety-five percent of all observations will fall within two standard deviations of the mean for normally distributed data. This means that observations with a z-score greater than two are highly unrepresentative of that distribution and may be considered outliers.

However, being unrepresentative of the group mean is insufficient motivation to exclude a participant. The outlier still belongs to the group distribution and should be included unless there is a supporting reason for exclusion such as measurement error, e.g. poor calibration of the eye tracker. If an extreme outlier is not excluded it can often have a disproportionate impact on the group mean and make statistical comparison of groups difficult. However, if this is the case it suggests that the sample size is too small and not representative of the overall population. Correct choice of sample size given an estimate of the predicted effect size combined with minimising measurement error should mean that subjective decisions do not have to be made about who’s data is “right” and who should be included or excluded.

Brown also believes that eye tracking research has so far marginalised viewers who have atypical ways of watching film, such as film scholars either by not studying them or treating them as statistical outliers and excluding them from analyses. However, I would argue that the only way to know if their way of watching a film is atypical is to first map out the distribution of how viewers typically watch films. If a viewer attended more to the screen edge than the majority of other viewers in a random sample of the population (as was the case with Brown’s film scholar colleague) this should show up as a large z-score when their gaze data is expressed relative to the group on a suitable measure such as Euclidean distance from the screen centre. Similarly, a non-native speaker of English may have appeared as an outlier in terms of how much time they spent looking at the speaker’s mouth in Robinson, Stadler and Rassell’s (this issue) study. Such idiosyncrasies may be of interest to researchers and there are statistical methods for expressing emergent groupings within the data (e.g. cluster analysis) or seeing whether group membership predicts behaviour (e.g. regression). These approaches may have not previously been applied to questions of film viewing but this is simply due to the immaturity of the field and the limited availability of the equipment or expertise to conduct such studies.

In my own recent work I have shown how viewing task influences how we watch unedited video clips (Smith and Mital, 2013), how infants watch TV (Wass and Smith, in press), how infant gaze differs to adult gaze (Smith, Dekker, Mital, Saez De Urabain and Karmiloff-Smith, in prep) and even how film scholars attend to and remember a short film compared to non-expert film viewers (Smith and Smith, in prep). Such group viewing differences are of great interest to me and I hope these studies illustrate how eye tracking has a lot to offer to such research questions if the right statistics and experimental designs are employed.

Brown’s second main criticism is that the field of eye tracking neglects “marginal” films. I agree that the majority of films that have so far been used in eye tracking studies could be considered mainstream. For example, the film/TV clips used in this special issue include Sherlock (2010), Up (2009) and Saving Private Ryan (1998). However, this limit is simply a sign of how few eye tracking studies of moving images there have been. All research areas take time to fully explore the range of possible research questions within that area.

I have always employed a range of films from diverse film traditions, cultures, and languages. My first published eye tracking study (Smith and Henderson, 2008) used film clips from Citizen Kane (1941), Dogville (2003), October (1928), Requiem for a Dream (2000), Dancer in the Dark (2000), Koyaanisqatsi (1982) and Blade Runner (1982). Several of these films may be considered “marginal” relative to the mainstream. If I have chosen to focus most of my analyses on mainstream Hollywood cinema this is only because they were the most suitable exemplars of the phenomena I was investigating such as continuity editing and its creation of a universal pattern of viewing (Smith, 2006; 2012). This interest is not because, as Brown argues, I have a hidden political agenda or an implicit belief that this style of filmmaking is the “right” way to make films. I am interested in this style because it is the dominant style and, as a cognitive scientist I wish to use film as a way of understanding how most people process audiovisual dynamic scenes.

Hollywood film stands as a wonderfully rich example of what filmmakers think “fits” human cognition. By testing filmmaker intuitions and seeing what impact particular compositional decisions have on viewer eye movements and behavioural responses I hope to gain greater insight into how audiovisual perception operates in non-mediated situations (Smith, Levin and Cutting, 2012). But, just as a neuropsychologist can learn about typical brain function by studying patients with pathologies such as lesions and strokes, I can also learn about how we perceive a “typical” film by studying how we watch experimental or innovative films. My previous work is testament to this interest (Smith, 2006; 2012a; 2012b; 2014; Smith & Henderson, 2008) and I hope to continue finding intriguing films to study and further my understanding of film cognition.

One practical reason why eye tracking studies rarely use foreign language films is the presence of subtitles. As has been comprehensively demonstrated by other authors in this special issue (Kruger, Szarkowska and Krejtz, this issue) and earlier in this article, the sudden appearance of text on the screen, even if it is incomprehensible leads to differences in eye movement behaviour. This invalidates the use of eye tracking as a way to measure how the filmmaker intended to shape viewer attention and perception. The alternatives would be to either use silent film (an approach I employed with October; Smith and Henderson, 2008), remove the audio (which changes gaze behaviour and awareness of editing; Smith & Martin-Portugues Santacreau, under review) or use dubbing (which can bias the gaze down to the poorly synched lips; Smith, Batten, and Bedford, 2014). None of these options are ideal for investigating foreign language sound film and until there is a suitable methodological solution this will restrict eye tracking studies to experimental films in a participant’s native language.

Finally, I would like to counter Brown’s assertion that eye tracking investigations of film have so far only generated “truisms”. I admit that there is often a temptation to reduce empirical findings to simplified take-home messages that only seem to confirm previous intuitions such as a bias of gaze towards the screen centre, towards speaking faces, moving objects or subtitles. However, I would argue that such messages fail to appreciate the nuance in the data. Empirical data correctly measured and analysed can provide subtle insights into a phenomenon that subjective introspection could never supply.

For example, film editors believe that an impression of continuous action can be created across a cut by overlapping somewhere between two (Anderson, 1996) and four frames (Dmytryk, 1986) of the action. However, psychological investigations of time perception revealed that our judgements of duration depend on how attention is allocated during the estimated period (Zakay and Block, 1996) and will vary depending on whether our eyes remain still or saccade during the period (Yarrow et al, 2001). In my thesis (Smith, 2006) I used simplified film stimuli to investigate the role that visual attention played in estimation of temporal continuity across a cut and found that participants experienced an overlap of 58.44ms as continuous when an unexpected cut occurred during fixation and an omission of 43.63ms as continuous when they performed a saccade in response to the cut. As different cuts may result in different degrees of overt (i.e., eye movements) and covert attentional shifts these empirical findings both support editor intuitions that temporal continuity varies between cuts (Dmytryk, 1986) whilst also explaining the factors that are important in influencing time perception at a level of precision not possible through introspection.

Reflecting on our own experience of a film suffers from the fact that it relies on our own senses and cognitive abilities to identify, interpret and express what we experience. I may feel that my experience of a dialogue sequence from Antichrist (2010) differs radically from a similar sequence from Secrets & Lies (1996) but I would be unable to attribute these differences to different aspects of the two scenes without quantifying both the cinematic features and my responses to them. Without isolating individual features I cannot know their causal contribution to my experience. Was it the rapid camera movements in Antichrist, the temporally incongruous editing, the emotionally extreme dialogue or the combination of these features that made me feel so unsettled whilst watching the scene? If one is not interested in understanding the causal contributions of each cinematic decision to an audience member’s response then one may be content with informed introspection and not find empirical hypothesis testing the right method. I make no judgement about the validity of either approach as long as each researcher understands the limits of their approach.

Introspection utilises the imprecise measurement tool that is the human brain and is therefore subject to distortion, human bias and an inability to extrapolate the subjective experience of one person to another. Empirical hypothesis testing also has its limitations: research questions have to be clearly formulated so that hypotheses can be stated in a way that allows them to be statistically tested using appropriate observable and reliable measurements. A failure at any of these stages can invalidate the conclusions that can be drawn from the data. For example, an eye tracker may be poorly calibrated resulting in an inaccurate record of where somebody was looking or it could be used to test an ill formed hypothesis such as how a particular film sequence caused attentional synchrony without having another film sequence to compare the gaze data to. Each approach has its strength and weaknesses and no single approach should be considered “better” than any other, just as no film should be considered “better” than any other film.


The articles collected here constitute the first attempt to bring together interdisciplinary perspectives on the application of eye tracking to film studies. I fully commend the intention of this special issue and hope that it encourages future researchers to conduct further studies using these methods to investigate research questions and film experiences we have not even conceived of. However, given that the recent release of low-cost eye tracking peripherals such as the EyeTribe[vii] tracker and the Tobii EyeX[viii] has moved eye tracking from a niche and highly expensive research tool to an accessible option for researchers in a range of disciplines, I need to take this opportunity to issue a word of warning. As I have outlined in this article, eye tracking is like any other research tool in that it is only useful if used correctly, its limitations are respected, its data is interpreted through the appropriate application of statistics and conclusions are only drawn that are based on the data in combination with a sound theoretical base. Eye tracking is not the “saviour” of film studies , nor is science the only “valid” way to investigate somebody’s experience of a film. Hopefully, the articles in this special issue and the ideas I have put forward here suggest how eye tracking can function within an interdisciplinary approach to film analysis that furthers our appreciation of film in previously unfathomed ways.



Thanks to Rachael Bedford, Sean Redmond and Craig Batty for comments on earlier drafts of this article. Thank you to John Henderson, Parag Mital and Robin Hill for help in gathering and visualising the eye movement data used in the Figures presented here. Their work was part of the DIEM Leverhulme Trust funded project ( The author, Tim Smith is funded by EPSRC (EP/K012428/1), Leverhulme Trust (PLP-2013-028) and BIAL Foundation grant (224/12).



Anderson, Joseph. 1996. The Reality of Illusion: An Ecological Approach to Cognitive Film Theory. Southern. Illinois University Press.

Batty, Craig, Claire Perkins and Jodi Sita. 2015. “How We Came To Eye Tracking Animation: A Cross-Disciplinary Approach to Researching the Moving Image”, Refractory: a Journal of Entertainment Media, 25.

Banks, Martin S., Jenny R. Read, Robert S. Allison and Simon J. Watt. 2012. “Stereoscopy and the human visual system.” SMPTE Mot. Imag. J., 121 (4), 24-43

Bradley, Margaret M., Laura Miccoli, Miguel A. Escrig and Peter J. Lang. 2008. “The pupil as a measure of emotional arousal and autonomic activation.” Psychophysiology, 45(4), 602-607.

Branigan, Edward R. 1984. Point of View in the Cinema: A Theory of Narration and Subjectivity in Classical Film. Berlin: Mouton.

Brown, William. 2015. “There’s no I in Eye Tacking: How Useful is Eye Tracking to Film Studies?”, Refractory: a Journal of Entertainment Media, 25.

Buchan, Julie N., Martin Paré and Kevin G. Munhall. 2007. “Spatial statistics of gaze fixations during dynamic face processing.” Social Neuroscience, 2, 1–13.

Coutrot, Antoine, Nathalie Guyader, Gelu Ionesc and Alice Caplier. 2012. “Influence of Soundtrack on Eye Movements During Video Exploration”, Journal of Eye Movement Research 5, no. 4.2: 1-10.

Cutting, James. E., Jordan E. DeLong and Christine E. Nothelfer. 2010. “Attention and the evolution of Hollywood film.” Psychological Science, 21, 440-447.

Dwyer, Tessa. 2015. “From Subtitles to SMS: Eye Tracking, Texting and Sherlock”, Refractory: a Journal of Entertainment Media, 25.

Dyer, Adrian. G and Sarah Pink. 2015. “Movement, attention and movies: the possibilities and limitations of eye tracking?”, Refractory: a Journal of Entertainment Media, 25.

Dmytryk, Edward. 1986. On Filmmaking. London, UK: Focal Press.

Henderson, John. M., 2003. “Human gaze control during real-world scene perception.” Trends in Cognitive Sciences, 7, 498-504.

Hochberg, Julian and Virginia Brooks. 1978). “Film Cutting and Visual Momentum”. In John W. Senders, Dennis F. Fisher and Richard A. Monty (Eds.), Eye Movements and the Higher Psychological Functions (pp. 293-317). Hillsdale, NJ: Lawrence Erlbaum.

Holmqvist, Kenneth, Marcus Nyström, Richard Andersson, Richard Dewhurst, Halszka Jarodzka and Joost van de Weijer. 2011. Eye Tracking: A comprehensive guide to methods and measures. Oxford, UK: OUP Press.

James, William. 1890. The principles of psychology (Vol.1). New York: Holt

Kruger, Jan Louis, Agnieszka Szarkowska and Izabela Krejtz. 2015. “Subtitles on the Moving Image: An Overview of Eye Tracking Studies”, Refractory: a Journal of Entertainment Media, 25.

Le Meur, Olivier and Baccino, Thierry. 2013. “Methods for comparing scanpaths and saliency maps: strengths and weaknesses.” Behavior research methods, 45(1), 251-266.

Magliano, Joseph P. and Jeffrey M. Zacks. 2011. “The Impact of Continuity Editing in Narrative Film on Event Segmentation.” Cognitive Science, 35(8), 1-29.

Mital, Parag K., Tim J. Smith, Robin Hill. and John M. Henderson. 2011. “Clustering of gaze during dynamic scene viewing is predicted by motion.” Cognitive Computation, 3(1), 5-24

Rayner, Keith. 1998. “Eye movements in reading and information processing: 20 years of research”. Psychological Bulletin, 124(3), 372-422.

Rayner, Keith, Tim J. Smith, George Malcolm and John M. Henderson, J.M. 2009. “Eye movements and visual encoding during scene perception.” Psychological Science, 20, 6-10.

Raz, Gal, Yael Jacob, Tal Gonen, Yonatan Winetraub, Tamar Flash, Eyal Soreq and Talma Hendler. 2014. “Cry for her or cry with her: context-dependent dissociation of two modes of cinematic empathy reflected in network cohesion dynamics.” Social cognitive and affective neuroscience, 9(1), 30-38.

Redmond, Sean, Jodi Sita and Kim Vincs. 2015. “Our Sherlockian Eyes: the Surveillance of Vision”, Refractory: a Journal of Entertainment Media, 25.

Robinson, Jennifer, Jane Stadler and Andrea Rassell. 2015. “Sound and Sight: An Exploratory Look at Saving Private Ryan through the Eye-tracking Lens”, Refractory: a Journal of Entertainment Media, 25.

Salt, Barry. 2009. Film Style and Technology: History and Analysis (Vol. 3rd). Totton, Hampshire, UK: Starword.

Sawahata, Yasuhito, Rajiv Khosla, Kazuteru Komine, Nobuyuki Hiruma, Takayuki Itou, Seiji Watanabe, Yuji Suzuki, Yumiko Hara and Nobuo Issiki. 2008. “Determining comprehension and quality of TV programs using eye-gaze tracking.” Pattern Recognition, 41(5), 1610-1626.

Smith, Murray. 2011. “Triangulating Aesthetic Experience”, paper presented at the annual Society for Cognitive Studies of the Moving Image conference, Budapest, June 8–11, 201

Smith, Tim J. 2006. An Attentional Theory of Continuity Editing. Ph.D., University of Edinburgh, Edinburgh, UK.

Smith, Tim J. 2012a. “The Attentional Theory of Cinematic Continuity”, Projections: The Journal for Movies and the Mind. 6(1), 1-27.

Smith, Tim J. 2012b. “Extending AToCC: a reply,” Projections: The Journal for Movies and the Mind. 6(1), 71-78

Smith, Tim J. 2013. “Watching you watch movies: Using eye tracking to inform cognitive film theory.” In A. P. Shimamura (Ed.), Psychocinematics: Exploring Cognition at the Movies. New York: Oxford University Press. pages 165-191

Smith, Tim J. 2014. “Audiovisual correspondences in Sergei Eisenstein’s Alexander Nevsky: a case study in viewer attention”. Cognitive Media Theory (AFI Film Reader), Eds. P. Taberham & T. Nannicelli.

Smith, Tim J., Jonathan Batten and Rachael Bedford. 2014. “Implicit detection of asynchronous audiovisual speech by eye movements.” Journal of Vision,14(10), 440-440.

Smith, Tim J., Dekker, T., Mital, Parag K., Saez De Urabain, I. R. & Karmiloff-Smith, A., In Prep. “Watch like mother: Motion and faces make infant gaze indistinguishable from adult gaze during Tot TV.”

Smith, Tim J. and John M. Henderson. 2008. “Edit Blindness: The relationship between attention and global change blindness in dynamic scenes”. Journal of Eye Movement Research, 2(2):6, 1-17.

Smith Tim J., Peter Lamont and John M. Henderson. 2012. “The penny drops: Change blindness at fixation.” Perception 41(4) 489 – 492

Smith, Tim J., Daniel Levin and James E. Cutting. 2012. “A Window on Reality: Perceiving Edited Moving Images.” Current Directions in Psychological Science. 21: 101-106

Smith, Tim J. and Parag K. Mital. 2013. “Attentional synchrony and the influence of viewing task on gaze behaviour in static and dynamic scenes”. Journal of Vision 13(8): 16.

Smith, Tim J. and Janet Y. Martin-Portugues Santacreu. Under Review. “Match-Action: The role of motion and audio in limiting awareness of global change blindness in film.”

Smith, Tim. J. and Murray Smith. In Prep. “The impact of expertise on eye movements during film viewing.”

Suckfull, Monika. 2000. “Film Analysis and Psychophysiology Effects of Moments of Impact and Protagonists”. Media Psychology2(3), 269-301.

Vilaro, Anna and Tim J. Smith. 2011. “Subtitle reading effects on visual and verbal information processing in films.” Published abstract In Perception. ECVP abstract supplement, 40. (p. 153). European Conference on Visual Perception. Toulousse, France.

Velichkovsky, Boris M., Sascha M. Dornhoefer, Sebastian Pannasch and Pieter J. A. Unema. 2001. “Visual fixations and level of attentional processing”. In Andrew T. Duhowski (Ed.), Proceedings of the International Conference Eye Tracking Research & Applications, Palm Beach Gardens, FL, November 6-8, ACM Press.

Wass, Sam V. and Tim J. Smith. In Press. “Visual motherese? Signal-to-noise ratios in toddler-directed television,” Developmental Science

Yarrow, Kielan, Patrick Haggard, Ron Heal, Peter Brown and John C. Rothwell. 2001. “Illusory perceptions of space and time preserve cross-saccadic perceptual continuity”. Nature, 414.

Zakay, Dan and Richard A. Block. 1996. Role of Attention in Time Estimation Processes. Time, Internal Clocks, and Movement. Elsevier Science.



[ii] An alternative take on eye tracking data is to divorce the data itself from psychological interpretation. Instead of viewing a gaze point as an index of where a viewer’s overt attention is focussed and a record of the visual input most likely to be encoded into the viewer’s long-term experience of the media, researchers can instead take a qualitative, or even aesthetic approach to the data. The gaze point becomes a trace of some aspect of the viewer’s engagement with the film. The patterns of gaze, its movements across the screen and the coordination/disagreement between viewers can inform qualitative interpretation without recourse to visual cognition. Such an approach is evident in several of the articles in this special issue (including Redmond, Sita, and Vincs, this issue; Batty, Perkins, and Sita, this issue). This approach can be interesting and important for stimulating hypotheses about how such patterns of viewing have come about and may be a satisfying endpoint for some disciplinary approaches to film. However, if researchers are interested in testing these hypotheses further empirical manipulation of the factors that are believed to be important and statistical testing would be required. During such investigation current theories about what eye movements are and how they relate to cognition must also be respected.

[iii] Although, one promising area of research is the use of pupil diameter changes as an index of arousal (Bradley, Miccoli, Escrig and Lang, 2008).

[iv] This technique has been used for decades by producers of TV advertisements and by some “pop” serials such as Hollyoaks in the UK (Thanks for Craig Batty for this observation).

[v] This trend in increasing pace and visual complexity of film is confirmed by statistical analyses of film corpora over time (Cutting, DeLong and Nothelfer, 2010) and has resulted in a backlash and increasing interest in “slow cinema”.

[vi] Other authors in this special issue may argue that taking a critical approach to gaze heatmaps without recourse to psychology allows them to embed eye tracking within their existing theoretical framework (such as hermeneutics). However, I would warn that eye tracking data is simply a record of how a relatively arbitrary piece of machinery (the eye tracking hardware) and associated software decided to represent the centre of a viewer’s gaze. There are numerous parameters that can be tweaked to massively alter how such gaze traces and heatmaps appear. Without understanding the psychology and the physiology of the human eye a researcher cannot know how to set these parameters, how much to trust the equipment they are using, or the data it is recording and as a consequence may over attribute interpretation to a representation that is not reliable.

[vii] (accessed 13/12/14). The EyeTribe tracker is $99 and is as spatially and temporally accurate (up to 60Hz sampling rate) as some science-grade trackers.

[viii] (accessed 13/12/14). The Tobii EyeX tracker is $139, samples at 30Hz and is as spatially accurate as the EyeTribe although the EyeX does not give you as much access to the raw gaze data (e.g., pupil size and binocular gaze coordinates) as the EyeTribe.



Dr Tim J. Smith is a senior lecturer in the Department of Psychological Sciences at Birkbeck, University of London. He applies empirical Cognitive Psychology methods including eye tracking to questions of Film Cognition and has published extensively on the subject both in Psychology and Film journals.


Politicizing Eye tracking Studies of Film – William Brown


This essay puts eye tracking studies of cinema into contact with film theory, or what I term film-philosophy, so as to distinguish film theory from specifically cognitive film theory. Looking at the concept of attention, the essay explains how winning and keeping viewers’ attention in a synchronous fashion is understood by eye tracking studies of cinema as key to success in filmmaking, while film-philosophy considers the winning and keeping of attention by cinema to be a political issue driven by economics and underscored by issues of control. As such, film-philosophy understands cinema as political, even if eye tracking studies of film tend to avoid engagement in political debate. Nonetheless, the essay identifies political dimensions in eye tracking film studies: the legitimization of the approach, its emphasis on mainstream cinema as an object of study and its emphasis on statistical significance all potentially have political connotations/ramifications. Invoking the concept of cinephilia, the essay then suggests that idiosyncratic viewer responses, as well as films that do not synchronously capture attention, might yield important results/play an important role in life in an attention-driven society.

In this essay, I wish to put eye tracking studies of film into dialogue with a more political approach to film, drawn from film theory, or what, for the benefit of distinguishing film theory from cognitive film theory, I shall term film-philosophy. In doing so, I shall draw out what for film-philosophy are some of the limitations of eye tracking, including its emphasis on statistical significance, or what most viewers look at when they watch films. I shall argue that we might learn as much, if not more, about cinema by paying attention not only to statistically significant and shared responses to films (what most viewers look at), but also to those viewers whose responses to a film do not form part of the statistically significant group, and/or to films that may not induce in viewers statistically significant and shared responses. In effect, we may find that there are insights to be derived from those who look at the margins of the cinematic image, rather than at the centre, even if those viewers are themselves ‘marginal’ in the sense that they are pushed to the margins of most/all eye tracking studies of film viewers. There is perhaps also value to be found in looking at ‘marginal’ films. In this way, we might find that idiosyncratic responses to a film or films is as important as the shared response. I shall also argue that there is a politics to the idiosyncratic response, especially when it is put into dialogue with film theoretical/film-philosophical work on cinephilia, and that as a result there is also a politics to eye tracking and its emphasis on statistical significance. I shall start, however, by looking at the state of eye tracking film research today.

On 29 and 30 July 2014, the Academy of Motion Picture Arts and Sciences (AMPAS) – the same American academy that distributes the so-called Oscars – held two events under the combined title of ‘Movies in Your Brain: The Science of Cinematic Perception’. The events included contributions from neuroscientists Uri Hasson, Talma Hendler and Jeffrey M. Zacks, psychologist James E. Cutting, directors Darren Aronofsky and Jon Favreau, editor Walter Murch and writer-producer Ari Handel. The host of the first evening was psychologist Tim J. Smith, whose eye tracking studies of cinema have arguably become the best known and most influential over recent years (see, inter alia, Smith 2012a; Smith 2013; Smith 2014). Through these events, as well as through coverage of these events in fashionable journals like Wired (Miller 2014a; Miller 2014b), we can see how eye tracking – together with the study of film using brain scanning technologies such as functional Magnetic Resonance Imaging (fMRI) – is clearly becoming important for our understanding of how films work. This in turn means that such studies are surely important to film studies.

For a detailed history and overview of eye tracking, explaining how it works and what it tells us about film, I cannot do better than to guide readers to the afore-mentioned work by Smith. Smith has soundly demonstrated, and with great clarity, how the human eye moves via small movements called saccades, and that in between saccades the human eye fixates. It is during fixations that humans take in visual information, with fixations being linked therefore to attention and to working memory; we tend to remember objects from our visual field upon which we have fixated, or to which we have paid attention. Clearly this is important to the study of film, since viewers typically attend only to parts of the movie screen at any given time, and not necessarily to others or to the whole of the screen (and the surrounding auditorium). Can/do filmmakers exert influence over where we look, for how long, and thus what we remember about a film – with those memories themselves lasting for greater or lesser periods of time? And if filmmakers do influence such things, how much influence do they exert and through which techniques? These are the questions that eye tracking technology can help to answer – and scholars like Smith do so with great skill and eloquence.

My aim, however, is not simply to reproduce findings by Smith and others who have used eye tracking devices to study film. In order to construct a theoretical argument concerning the importance of the idiosyncratic, or ‘cinephilic’, response to a film or films in general, as well as the importance of a filmmaker not necessarily ‘controlling’ where a viewer looks, but instead allowing/encouraging viewers precisely to look idiosyncratically, cinephilically, or where they wish, I need instead to bring the scientific and ‘apolitical’ use of eye tracking devices into a political discourse concerning the nature of cinema, power, hegemony and the issue of cinematic homogeneity and/or heterogeneity. This is a controversial maneuver – in that it will bring together two areas of film studies that often seem to stand in ‘opposition’ to each other, namely cognitive film theory and a film theory that still plies its trade using Continental philosophy, or what for the sake of simplicity I shall term film-philosophy. My desire is not simply to be controversial, however. Rather it is to engage with what eye tracking means to film studies, both currently and potentially in the future.

To begin to bring eye tracking studies of film into the ‘political discourse’ mentioned above, I shall relate an anecdote. A semi-regular response from colleagues in film studies, when I tell them about eye tracking studies of film viewing, is that eye tracking doesn’t tell us anything about films that we didn’t already know. Is it a surprise that we tend to look more often at the center of the screen? Is it a surprise that we typically attend more to brightly illuminated parts of the screen than to dimly lit ones? Is it a surprise that we tend to direct our attention towards human faces when watching a film that features human characters? Anyone who has consciously thought about what they do while watching a film will be able to tell from memory alone that these things are all true. As a result, eye tracking studies of film can sometimes be filled with what, at least to the film student/scholar, are truisms. By way of an example, Paul Marchant and colleagues say that ‘these strategies and techniques… [capture] the audience’s visual attention: focus, camera movement, eye line match, color and contrast, motion of elements within the shot, graphic matching’ (Marchant et al. 2009, 158). On my print-out of Marchant et al.’s essay, my own apostil next to this assertion reads as follows: ‘Do we not know this already (otherwise cinema would not have developed these techniques)?’ Many, if not all, film viewers will know simply from experience that these techniques help to guide their attention, even if they are blissfully unaware of the relationship between eye fixations, attention and memory. Of course, it is pleasing to have our introspective responses to/our intuitive knowledge about cinema ‘scientifically’ confirmed (to a large extent, but not entirely – about which, more later); but essentially, so my colleagues’ argument goes, eye tracking studies tell us what we already know.

Now, even if I myself find some eye tracking studies of film to be ‘truistic’, I nonetheless believe that eye tracking studies of film are of great importance. However, their importance is perhaps in playing a role that is different from the one that eye tracking studies of film seem to give to themselves, which is as a key component of cognitive film theory. Instead, I think that eye tracking studies of film are important for film theory, or what today is termed film-philosophy. I shall explain the distinction between cognitive film theory and film-philosophy presently.

Little in this world is uniform, and so by definition I generalize when I say that the basic tenet of cognitive film theory is, with David Bordwell and Noël Carroll’s Post Theory: Reconstructing Film Studies (1996) serving as its figurehead, for film studies to move towards a theory of cinema based on the analysis of films themselves, and away from a film theory that uses cinema as a means of confirming or denying a Lacanian understanding of the human and/or an Althusserian/Marxist conception of contemporary capital. In spite of cognitive film theory’s lack of uniformity, eye tracking studies of film are nonetheless part of cognitive film theory’s project to help us to look at cinema ‘as it is’, and not to use cinema as a political football. Conversely, film-philosophy is in general informed by the kinds of Continental philosophers, often though not limited to Gilles Deleuze, that cognitive film theorists reject, and it engages not just with films ‘as they are’, but with the politics of films.

Now, to claim that we can isolate films and film viewing from a human world that is perhaps always political, and to claim that we can then analyse films ‘as they are’, is perhaps absurd: films ‘as they are’ are part of a political world, and cognitive film theorists are not unaware of this, just as film-philosophers are not incapable of scientific analysis. However, how much politics is allowed into the analysis of films perhaps informs the broad distinction between cognitive film theory and film-philosophy, as I hope to clarify by looking briefly at the role of attention in the work of two scholars, Tim J. Smith and Jonathan Beller. In his ‘Attentional Theory of Cinematic Continuity’ (AToCC), Smith (2012a) uses eye tracking studies to demonstrate how filmmakers capture and maintain viewers’ attention, with certain techniques, mainly those associated with continuity editing, being more successful than others. Meanwhile, in his Cinematic Mode of Production: Attention Economy and the Society of the Spectacle, Beller (2006) suggests that capturing attention is not necessarily an aesthetic, but rather a political project: the more attention a film garners, the more success one will have in monetizing that film, with the making of money becoming the bottom line of cinema. Beller does not appeal to some early cinema that did not attempt to elicit viewers’ attention and thus make money; such an early cinema did not necessarily exist. Rather, Beller argues that cinema has always been part of an economy that is based on attention; indeed, cinema plays a key role in naturalizing this attention economy, meaning that cinema has not always been necessarily capitalist, but that the capitalist world endeavors as much as possible to become cinematic, to capture our attention as much as possible in order to ‘win’ the economic race, since capturing eyeballs means making money. Smith explains how attention is captured; Beller offers an explanation as to why. Even though filmmakers rely on natural processes in order to capture attention (Smith), the process of consistently trying to capture our attention (‘cinema’) is not natural, but political and economic (Beller).

James E. Cutting, in commenting on an earlier draft of this paper, says that the results of eye tracking studies of film, which reveal how filmmakers capture attention, are

big news… because almost nothing else does this – not static pictures (photographs, artworks), not class room behavior by teachers, not leaders of business meetings, and often not even spectacles of various kinds (sporting events, rock concerts, etc); even TV is typically not as good as the average narrative, popular movie. (Cutting, signed peer review 2014)

If cinema is indeed better at capturing our attention than these other media, and if in some senses it is better at capturing our attention than those parts of the world that do not feature such media – i.e. if cinema is better at capturing our attention than reality – then cinema, and the making-cinematic of reality in a bid to capture attention, to make money and/or to influence people (Cutting compares cinema in particular to teachers and to business leaders) is profoundly political. It is profoundly political because learning about how to capture attention – learning about how cinema works – is tied to the shaping of our material reality (putting screens everywhere) and to controlling attention (encouraging us to look at those screens, and not at the rest of reality). Cognitive film theory is apolitical; film-philosophy, meanwhile, engages in the very political dimensions of cinema. Eye tracking studies of film tend to position themselves as part of the former; my aim here is to bring them into dialogue with the latter.

If eye tracking studies of film tend to position themselves as part of a would-be apolitical approach to cinema, then in their investigation into cinema, they are nonetheless conducting an investigation into politics, as per Beller’s equation of cinema with politics highlighted above. However, while eye tracking studies of film position themselves as apolitical, politics do creep into eye tracking studies, especially through what I shall call their absences. What is more, these politics do relate to film-philosophy’s ‘political’ approach to film. In order to demonstrate this, I shall begin by analyzing how eye tracking studies of film have sought historically to legitimate themselves.

Early in an essay that gives an overview of eye tracking studies of film, Smith asserts, without naming any, that the hypotheses of film theory ‘generally remain untested’ (Smith 2013, 165). In this almost throwaway comment, we perhaps find important information. For in asserting that eye tracking is what can help us to ‘test’ out some theories of film, as Smith goes on to do in relation to Sergei M. Eisenstein’s writing about his own film, Alexander Nevsky (USSR, 1938), he perhaps overlooks how film theorists often (but perhaps not always) try (though not always with success) to construct their theories based on the films that they have seen, studied and perhaps made, and not the other way around. That is, Smith seems not to consider that watching films is itself a means of testing our theories about films – without the need for eye tracking devices. On a related note, while he does consider filmmakers like Eisenstein, D.W. Griffith, Edward Dmytryk and others as ‘experimentalists’ of sorts (who have tested their own theories), Smith also does not fully acknowledge that the history of cinema can itself be seen as a prolonged ‘test’ in what ‘works’ or ‘does not work’ with audiences – with that which ‘works’ being regularly adopted as either a short- or a long-term strategy by the film industry, be that in terms of re-using storylines, adopting a specific cinematic style, employing bankable film stars, using topical settings, engaging with zeitgeist themes and so on. Instead, it is Smith’s intervention that will validate or otherwise that history of theory and practice, and which will confirm what filmmakers, and perhaps also many audience members, have probably known for a long time, even if putting their knowledge into practice sometimes proves harder than we might imagine (because otherwise films would presumably not have ‘mistakes’ in them).

Now, it’s natural that a (relatively) new approach to studying film would need to legitimize itself in order to gain credibility and following – and Smith clearly charts the c30 year trajectory of eye tracking in film studies since the 1980s onwards (Smith 2014: 90). Nonetheless, if the history of cinema is not ‘test’ enough for Smith, then implicitly a claim is being made here about what constitutes a ‘real’ test, and, by extension, what sort of person can carry out a ‘real’ test. In other words, eye tracking, and the cognitive framework more generally, here legitimizes itself as being a tool for verifying (scientifically) what previously were ‘mere’ and speculative theories (these are my terms) – with the people qualified to carry out these tests being neither filmmakers nor audience members, but psychologists. By justifying eye tracking in this way, Smith is not just making a statement of fact (eye tracking demonstrates that viewers look at the same things at the same time during films made using the continuity editing style), but he is also – I assume unintentionally – making an implicit value judgment that carries political assumptions regarding what constitutes a/the most legitimate framework for learning and knowing about film. If, as per my anecdote above, I can and do know the same things via introspection that eye tracking tells me, then why is introspection not equally legitimate as a framework, even if the former involves less visible labor, and certainly less sexy imagery, and thus does not seem to involve any real ‘testing’?

Eye tracking thus seeks ‘politically’ to legitimate itself as a tool for film analysis. To be clear: eye tracking is legitimate, but it is also always already making claims about what constitutes knowledge: introspection is not knowledge, while science is – even if both can lead to the same understanding. Importantly, in producing visible evidence (the afore-mentioned ‘sexy imagery’ of colored clusters of eye-gaze on scenes from films), then eye tracking studies are also always already cinematic, by which I mean to say that they affirm a system whereby the visual/the cinematic (here are pictures of attention being captured) are validated above invisible (here, introspective) approaches to the same knowledge. This in turn always already affirms the process of cinema and attention-grabbing as being the (political) system that is most powerful.

If eye tracking affirms a politically cinematic world, in that cinematic forms of knowledge are more valid than invisible, i.e. uncinematic, ones, then within that cinematic world eye tracking might also, and in some respects implicitly does, legitimate some forms of cinema over others. This is suggested by the way in which eye tracking studies look predominantly at Hollywood/mainstream cinema in their analyses of film. For example, in his AToCC, Smith (2012a) cites a diverse range of movies, including L’escamotage d’une dame au théâtre Robert Houdin/The Vanishing Lady (Georges Méliès, France, 1896) and L’année dernière à Marienbad/Last Year at Marienbad (Alain Resnais, France/Italy, 1961), but eye tracking data are given mainly for contemporary Hollywood films, including Blade Runner (Ridley Scott, USA/Hong Kong/UK, 1982), Requiem for a Dream (Darren Aronofsky, USA, 2000) and There Will Be Blood (Paul Thomas Anderson, USA, 2007), with Smith suggesting that continuity editing is the form of cinema best suited to capturing attention.1

The absence of eye tracking data on those other, non-Hollywood films is perhaps telling, as suggested by two respondents to Smith’s essay, who query how his theories would apply to different cinemas, including the avant garde (Freeland 2012, 40-41) and, at least by implication, Japanese cinema (Rogers 2012, 47-48). Eye tracking would of course yield important insights into avant-garde and other forms of cinema, but that information is not offered here.

Furthermore, Smith’s suggestion that continuity editing is the form best suited to capturing attention, also prompts Paul Messaris and Greg M. Smith to argue that continuity editing violations, in particular jump cuts, are quite regular and not particularly detrimental to the continuity of the film viewing experience (Messaris 2012, 28-29; Greg M. Smith 2012, 57). Malcolm Turvey, meanwhile, argues that the film viewing experience is always continuous, meaning that the ‘continuity’ of continuity editing ‘is not continuity of viewer attention per se… but rather the manner in which films engage and manage that attention’ (Turvey 2012, 52-53; for Smith’s riposte to these responses and more, see Smith 2012b).

These responses highlight how filmmaking ‘perfection’ (an absence of continuity errors) need not be fetishized too much; audiences are quite happy to watch films with continuity errors (many of which they will not notice). Furthermore, many audiences love what Jeffrey Sconce (1995) might term ‘paracinema’ – i.e. ‘trash’ cinema and ‘bad’ movies – be they intentionally ‘bad’ or otherwise. In other words, it would seem that as long as audiences are primed regarding how they should receive a film (or, in Turvey’s language, as long as their attention is managed and then engaged in the right way), then you don’t need to care about and can even love the stylised acting, the ropey mise-en-scène, the unmotivated camera movements, the strange edits and the story loopholes of, say, The Room (Tommy Wiseau, USA, 2003), supposedly the worst film in history. Under the right circumstances (with the right management/ preparation), it would seem that audiences can like pretty much anything, including a 485-minute film of the Empire State Building (Empire, Andy Warhol, USA, 1964). In other words, while in his AToCC Smith mentions Méliès and Resnais, and while he engages with Eisenstein and other filmmakers elsewhere, the AToCC puts an emphasis on mainstream Hollywood cinema and its predominant system of continuity editing, since this cinema elicits a synchronicity of response, or control over attention, in that viewers attend to the same parts of the screen at the same time – while also often failing to detect edits done in the continuity editing style (see Smith and Henderson 2008). There is a seeming bias here towards mainstream, narrative filmmaking, the engrossing nature of which is lauded at the expense of other cinemas.

Let us move away from Smith in order to demonstrate how this bias is not his alone. Jennifer Treuting suggests that ‘[t]he use of eye tracking… can help filmmakers and other visual artists refine their craft’ (Treuting 2006: 31). In some respects, this is an innocent comment; I have no doubt that eye tracking can help filmmakers and other visual artists to refine their craft. But suggested in this ‘refinement’ is also the move towards validating the mainstream/continuity style at the expense of its alternatives. A combined eye tracking and fMRI study carried out by Uri Hasson and colleagues also makes this clear: much fuss is made over how work by Alfred Hitchcock elicits greater synchrony (‘inter-subject correlation’) in viewers than does an ‘unstructured’ shot of a concert in Washington Square Park, a film that is simply a ‘point of reference’ and which ‘fails to direct viewers’ gaze’ (Hasson et al. 2008, 13-14; emphasis added). My reference above to Warhol’s Empire here becomes apposite: what Hasson and colleagues dismiss as a ‘point of reference’ and as a ‘failure’ in various respects defines one of the great experimental films. Perhaps ‘marginal’ films like Empire should also be considered successful – but at achieving something different to the work of Hitchcock, and perhaps Hasson’s film is not a ‘point of reference’, but an experimental work that equally inhabits the totality of films in the world that we shall call cinema.

If Hitchcock ‘succeeds’ in controlling viewers’ attention, while Warhol by implication ‘fails’, then eye tracking becomes implicitly/inevitably embroiled in not just what film is, but in what film could or should be – as Treuting’s suggestion that eye tracking might feed back into filmmaking also makes clear. This suggests that there is a politics to eye tracking film studies, particularly in the UK where universities are increasingly relying on ‘impact’, particularly on the economy, in order to survive: they don’t just observe films, but feed back into how films are, or should be, made, by exploring what is ‘successful’ in terms of eliciting attention, getting bums on seats and thus making money. In some respects, eye tracking in particular and cognitive film theory in general are now dragged back towards the Marxist approach to cinema that cognitive film theory initially sought to reject: it, too, shapes/seeks to shape cinema just as Marxist film theory in effect lobbied for alternatives to the mainstream. However, where Marxist film theory lobbied for a rejection of mainstream cinematic techniques, eye tracking studies seem to validate them – and to suggest that filmmakers might ‘refine their craft’ by adopting/intensifying them. Saving the thorny issue of ‘control’ and ‘influence’ for later, there is still a political dimension to this potential validation of mainstream cinema techniques, because it reaffirms the economic hegemony of one style over others and it also validates in some degree a homogeneity of product (and of audience?) – all within a ‘cinematic’ economic system that is itself predicated upon gaining attention. Cinema is both business and art, but if art is one thing it is unique/different, and so a move towards homogeneity is a move towards the reduction of art in favor of business. If it requires an artist rather than an academic to make this clear, then Darren Aronofsky’s apprehensive response to Hasson’s work at the AMPAS events hopefully serves this purpose: ‘“It’s a scary tool for the studios to have,” Aronofsky said. “Soon they’ll do test screenings with people in MRIs.” The audience laughed, but it didn’t seem like he was joking, at least not entirely’ (Miller 2014b).

I have so far argued that cinema is political, that eye tracking studies have required some political maneuvering in order to legitimate themselves, and that the focus on continuity editing/mainstream cinema by eye tracking studies may also have a political dimension. However, are eye tracking studies themselves without methodological politics, in that they simply report findings? I wish presently to suggest that eye tracking research does have methodological limitations – which is why I asserted above that eye tracking film studies are only to a large extent and not entirely reliable – and that these limitations also have a political dimension. The methodological limitations are not simply a case of potential inaccuracies regarding the type of eye-tracker used, determining how long the eye needs to be still for a fixation to take place, what algorithm is used to measure this, or how accurate is the eye-tracker in determining where exactly the eye is looking – all ongoing issues with eye tracking technologies (see, inter alia, Wass et al. 2013; Saez de Urabain et al. 2014). It is also a case of issues of statistical significance and the politics thereof, particularly what I shall call the temporal politics, and to a lesser extent the social politics, of eye tracking. In relation to the latter, many eye tracking studies involve students in order to carry out their research (e.g. Tatler et al. 2010; Võ et al. 2012). As a result, the findings might pertain not universally, but to population members who are of a certain age and, if we can say that university students tend to be from more affluent backgrounds, a certain socioeconomic status. In relation to statistical significance, meanwhile, all studies tend to discount those viewers who do not look where the researchers want them to look; for example, in a study of where people look when viewing moving faces, only 87 per cent of fixations targeted the face region when shown a moving face with sound, with that figure dropping to 82 per cent when shown a moving face without sound (Võ et al. 2012, 7). Of course, when what one is investigating is where people look when they look at faces, it is correct to discount those 13-18 per cent of fixations that were not directed at the face. But the point is that similar discounts happen all the time, not least in the process of averaging that we see in various experiments, including those mentioned by Marchant et al., Hasson et al., and Smith. And yet, where neuroscience is based in large part upon the study of anomalous brains – from autists to damage sufferers to perceived geniuses – psychologists engaged in eye tracking tend to go with force majeure and report the average, or what most people do. There may however be in human populations a ‘long tail’ (to use the terminology of Chris Anderson, 2006) that may not in any one experiment be statistically significant, but which over a number of experiments might begin to show patterns that could help us to understand vision and attention in a more ‘holistic’ fashion.

To continue by way of another anecdote: a film scholar took part in an eye tracking film study at a leading European university. Upon completion, the colleague conducting the study told the scholar that they looked in completely different places – generally at the margins of the screen – to where most of the other participants looked, and that their participation was therefore useless to the study. If we can say that the film scholar looked (perhaps deliberately) where others do not look, then to what degree is film viewing a matter of, to use Turvey’s language, management and engagement? That is, do film scholars look differently at films, perhaps even at the world? And if so, what can we make of this?

The Russian ‘godfather’ of eye tracking studies, Alfred Yarbus. famously published in the 1960s that setting viewers different tasks will completely modify where they look at an image (Yarbus 1967; see also Tatler et al. 2010). There is much to extrapolate from this. For while eye tracking studies will use terms like ‘naïve’ to define how participants are unaware of the aims of the study, when it comes to film viewing, humans are rarely naïve at all. Advertising, reviews and other publicity materials are always – at least on an implicit level – telling us how and where to look at films, just as the media and our conspecifics are telling us how and where to look in the real world. Now, it may well be that humans who have never before seen a movie have little trouble understanding Hollywood cinema, as affirmed, inter alia, by both Messaris (2012, 31-33) and Smith (2012b, 74). Nonetheless, our attention is not just managed and engaged in the cinema, but it is also managed and engaged for the cinema, and I have not read any studies where psychologists showed a non-Hollywood film to first-time audiences and in which those audiences had trouble understanding the film; that is, these studies affirm nothing about the comprehension of continuity editing per se, although they might affirm that humans can understand cinema without training – as is presumably affirmed worldwide everyday as the first film shown to children is not a Hollywood film but a Bollywood, Nollywood, Filipino, Chinese or other movie; what is more, the studies perhaps only affirm the cultural hegemony enjoyed by Hollywood, in that psychologists present a Hollywood and not another film to those first-time viewers – and then use that research to affirm Hollywood’s economic primacy as being a result of its filmmaking style and not also as a result of historical and other factors. As Cynthia Freeland reminds us in her response to Smith’s AToCC, James Peterson in Post Theory argued that

a common feature of avant-garde film viewing – one that usually passes without comment: viewers initially have difficulty comprehending avant-garde films, but they learn to make sense of them. Students who take my course in the avant-garde cinema are at first completely confused by the films I show; by the end of term, they can speak intelligently about the films they see. (Peterson 1996, 110; quoted in Freeland 2012, 41)

In other words, as per my assertions re: The Room above, it is quite possible that humans would quite easily watch – and enjoy – all manner of different films, but that they do not because their attention is not ‘managed and engaged’. Again, this is a political issue, because if it is true, then it is about who can afford to use the mass media to manage and engage the attention of the most people in the quest for profit – meaning that alternative approaches to filmmaking are forced either to adopt the same system of filmmaking to compete, or they are pushed to the margins where the struggle to find audiences – because people are not prepped to watch them. The scholar at the European university has had a long education in film, and this potentially helps to manage and engage differently how they attend to them; their ‘statistically insignificant’ response might well be important in helping to demonstrate how we can not just view different/marginal films, but also view mainstream films differently.

Cutting and colleagues suggest that film editing correlates with a 1/f pattern, with 1/f (one over frequency) referring to the ‘natural’ amount of time that humans attend to objects in the real world (Cutting et al. 2010). In other words, the suggestion is that Hollywood editing rhythms reflect human attention spans – ‘evolving toward 1/f spectra… [meaning that] the mind can be “lost”… most easily in a temporal art form with that structure’ (Cutting et al. 2010, 7). Now, since David L Gilden only came up with the 1/f structure in 1995 (Gilden et al. 1995), it remains untested, and untestable without a time machine, as to whether the human attention span itself changes over time, or according to culture. That said, if cinema has always been going at about the pace that human attention was working, and if cinema cutting rates have accelerated since the 1930s and through to the present era, then attention spans may well interact with culture, and even be shaped by our media.

I often ask my students how long they should look at a painting for. It’s a trick question, because of course there is no right or wrong answer. It is my (untested) hypothesis, however, that the amount of time humans look at paintings has been shaped by the media, including films; that is, in galleries, I see people look at paintings for about the average duration of a film shot (four to five seconds) – although recently they have begun to look at a painting for about the amount of time that it takes them to take a photo of that painting with their mobile handheld device.2 Smith, citing Cutting’s work, suggests that

[i]n an average movie theatre with a 40-foot screen viewed at a distance of 35 feet, this region at the centre of our gaze will only cover about 0.19 per cent of the total screen area. Given that the average shot length of most films produced today is less than 4 seconds… viewers will only be able to make at most 20 fixations covering only 3.8 per cent of the screen area. (Smith 2013: 168)

Given that paintings vary in size, one cannot rightly say how long it would take to see a ‘whole’ painting. But if one looks at a cinema-screen sized painting for 4 seconds, then one would, after Smith, fixate on about 4 per cent of that painting. In order to see the whole painting, then more time is needed, just as more time is needed to take in our natural, rather than cinematic, environment, since we also only ever see a small proportion of that at any one time.

Relating to film the foregoing foray into painting, we might add that, given that we do not take in visual information while saccading, and given that saccades have a duration of 20-50 miliseconds (Smith 2013, 168), this means that we do not take in visual information for 0.7 seconds during every four-second shot. At 90 minutes in length, there are on average 1,350 shots per film, meaning that we do not take in visual information for 15 minutes and 45 seconds per film – blinks and turning away from the screen for snogging and toilet breaks not included. If spatially we only see 3.8 per cent of the screen during a shot, and if we only see 82.5 per cent of a film’s duration, this means that we see around 3.14 per cent of the average (Hollywood) film (no spooky π references intended).3 To be clear, these statistics apply not just to Hollywood: I would only see 3.14 per cent of Empire if I were to watch it at the cinema, too. But since it is a film comprised of a single-seeming shot and a static frame, Empire clearly encourages viewers to look for longer at the space within the frame, while Hollywood arguably does not give viewers the time to do so, since the content and duration of images is concerned uniquely with story-telling, and not with anything else. This in turn affects for how long we think that we are supposed to look at objects in our everyday lives, if for the sake of argument my gallery hypothesis be allowed to stand. Neither paintings, nor Empire, nor the world itself is organized to be seen ‘cinematically’, even if Empire is undoubtedly a work of cinema. That is, they all invite contemplation, but what they often receive is a shot-length of attention before they become boring (Empire perhaps deliberately so). Neither paintings, nor Empire, nor large swathes of the world itself controls our attention in the way cinema does; there would be much more idiosyncrasy and less synchrony of attention when looking at Empire than at a mainstream film. If the proliferation of screens featuring cinematic techniques is the making-cinematic of reality in the services of capital, then the refusal to attend to paintings, Empire and the world itself suggests not just that our attention is controlled while watching a film, but that our attention is working at a ‘cinematic’ rhythm – a rhythm that Empire uses the very apparatus of cinema in order to try to break.

The ‘temporal politics’ that I mentioned above, then, is to do with the management and engagement of attention rhythms/patterns not just in cinema, and not just for cinema (we are prepped to be movie viewers), but also by cinema for the world (people pay attention to paintings in galleries about as long as they would attend to a film shot/as long as a film shot would allow them to attend to it, before ‘cutting’, or turning away, likely getting out one’s phone, the screen of which one can also cut across with the swipe of a thumb). Politics rear their head again as homogeneity of attention span, perhaps even of life rhythm, jump into bed with the political and economic concerns that govern the structures of our society. Almost certainly in an unwitting fashion (this is not a conspiracy), validating certain cutting rates and attentions spans over others becomes an issue linked to social control, and the economic bottom line of both cinema and perhaps society as a whole. Eye tracking studies of film are part of this political ecology.

A final throw of the dice. Those of us engaged in education are of course part of a system that prepares our students for the real world. But I am personally also committed to encouraging my students sociably and communicatively to develop their individuality, to become ‘idiosyncratic’, to look at the world differently and various other notions that have long since been corporatized disingenuously as advertising slogans. Being a film teacher, I do this through encouraging my students to look differently at films. Hollywood films employ techniques that do not encourage us to look differently at movies; instead, our attention (and our brain activity) are synchronized. What is more, the idiosyncratic viewers that do look at films differently (the European film scholar) are discounted from eye tracking studies for not conforming to the norm (for not confirming to us what we already know, even if not through a scientific framework). Not only might we encourage our students to look at the world differently (to become the idiosyncratic, perhaps ‘educated’ viewer), but we might also encourage our students to make films differently, since films can also play a role in encouraging us to see the world differently, to become ‘idiosyncratic’ individuals (Hasson’s research involved the production of an interesting avant garde work, regardless of his own thoughts on the matter). Perhaps eye tracking (and fMRI) studies can help in this by turning their attention not to the majority, but to the minority, to the marginal people who look, both figuratively and literally, at the margins of the screen, and at marginal films. And this perhaps involves slowing attention down, and making it (willfully?) deeper rather than rapid and superficial. I know that the longer I look at a painting, the more the power of its creation comes to my mind, the more I marvel at it and also at the world that sustains it. In other words, it brings me joy. As I repeat often to those students who do not seem committed to participating in my classes: the more you put in, the more you get out.

Would to educate (to manage and engage attention) both in the classroom and through making and showing different sorts of (slower?) films not simply replace one trend with another, and itself be prey to political issues regarding what type of ‘idiosyncrasy’ is best? Of course, such questions are going to be of ongoing importance and would need constant attention. In relation to eye tracking film studies, though, the introduction of a ‘temporal’ dimension might help enrich our understanding of idiosyncrasy. The spatial information that idiosyncratic eye-tracks give to us is chaotic and without pattern – and thus of not much use to the psychologist; however, there may well be temporal patterns that emerge when we consider ‘idiosyncrasy’ as a shared process (to be encouraged?), rather than as a reified thing to be commoditized.

Paul Willemen has written about cinephilia as being the search for/paying attention to otherwise overlooked details in movies (Willemen 1994, 223-57). Meanwhile, Laura Mulvey has argued that DVD technology allows the film viewer to develop a deeper, cinephilic relationship with movies, since she can now pause and really analyse a film – by ‘delaying’ it/slowing it down (Mulvey 2006, 144-60). To look idiosyncratically at a movie is thus to look ‘cinephilically’; it is to look at cinema with love, perhaps to look with love tout court – but in this instance at cinema. My argument comes full circle, then, as we bring cognitive film theory, via eye tracking film studies, into contact with film theory/film-philosophy, exemplified here by Mulvey as a major figure from the Screen movement/moment. There is no I in eye tracking – but if we can accept that eye tracking studies of cinema are embroiled in a political discourse (and a political reality) concerning which films are validated as better than others and why, then perhaps by putting an ‘I’ into eye tracking, by looking at the idiosyncratic in addition to the statistically significant, then we may be able to bring about different ways of seeing and making films.


  1. The exception is Dancer in the Dark (Lars von Trier, Spain/Argentina/ Denmark/Germany/Netherlands/Italy/USA/UK/France/Sweden/Finland/ Iceland/Norway, 2000).
  2. One of my peer reviewers took issue with the speculative nature of this suggestion. The other agreed with it.
  3. Note that I insist on the term ‘visual information’ – since film does not just engage us visually, but also aurally and via other senses (as Freeland, 2012, also reminds Smith in her response to his AToCC essay).



Anderson, Chris. 2006. The Long Tail: Why the Future of Business is Selling Less of More. New York: Hyperion.

Beller, Jonathan. 2006. The Cinematic Mode of Production: Attention Economy and the Society of the Spectacle. Lebanon, N.H.: Dartmouth College Press.

Bordwell, David. 2010. “Now you see it, now you can’t.” Observations on Film Art: Kristin Thompson and David Bordwell, June 21.

Bordwell, David, and Noël Carroll. 1996. Post-Theory: Reconstructing Film Studies. Madison: University of Wisconsin Press.

Cutting, James E.. 2014. Peer Reviewer’s Comments. Received October 1.

Cutting, James E., Jordan E. DeLong and Christine E. Nothelfer. 2010. “Attention and the Evolution of Hollywood Film.” Psychological Science 20:10, 1-8.

Freeland, Cynthia. 2012. “Continuity, Narrative, and Cross-Modal Cuing of Attention.” Projections: The Journal for Movies and Mind 6:1, 34-42.

Gilden, D.L., T. Thornton, and M.W. Mallon. 1995. “1/f Noise in Human Cognition.” Science 267:1837-39.

Hasson, Uri, Ohad Landesman, Barbara Knappmeyer, Ignacio Vallines, Nava Rubin and David J. Heeger. 2008. “Neurocinematics: The Neuroscience of Film.” Projections: The Journal for Movies and Mind 2:1, 1-26.

Marchant, Paul, David Raybould, Tony Renshaw and Richard Stevens. 2009. “Are you seeing what I’m seeing? An eye tracking evaluation of dynamic scenes.” Digital Creativity 20:3, 153-163.

Messaris, Paul. 2012. “Continuity and Its Discontents.” Projections: The Journal for Movies and Mind 6:1, 28-33.

Miller, Greg. 2014a. “How Movies Manipulate Your Brain to Keep You Entertained.Wired, August 26.

Miller, Greg. 2014b. “How Movies Synchronize the Brains of an Audience.” Wired, August 28.

Mulvey, Laura. 2006. Death 24x a Second: Stillness and the Moving Image. London: Reaktion.

Peterson, James. 1996. “Is a Cognitive Approach to the Avant-Garde Cinema Perverse?” In Post Theory: Reconstructing Film Studies, edited by David Bordwell and Noël Carroll, 108-129. Madison: University of Wisconsin Press.

Rogers, Sheena. 2012. “Auteur of Attention: The Filmmaker as a Cognitive Scientist.” Projections: The Journal for Movies and Mind 6:1, 42-49.

Saez de Urabaín, Irati R., Mark H. Johnson and Tim J. Smith. 2014. “GraFIX: A semiautomatic approach for parsing low- and high-quality eye tracking data.” Behavior Research Methods, March 27, pp. 1-20.

Sconce, Jeffrey. 1995. “‘Trashing’ the academy: taste, excess, and an emerging politics of cinematic style.” Screen 36:4, 371-393.

Smith, Greg M. 2012. “Continuity Is Not Continuous.” Projections: The Journal for Movies and Mind 6:1, 56-61.

Smith, Tim J. 2012a. “The Attentional Theory of Continuity Editing.” Projections: The Journal for Movies and Mind 6:1, 1-27.

Smith, Tim J. 2012b. “Extending AToCC: A Reply.” Projections: The Journal for Movies and Mind 6:1, 71-78.

Smith, Tim J. 2013. “Watching You Watch Movies: Using Eye Tracking to Inform Cognitive Film Theory.” In Psychocinematics: Exploring Cognition at the Movies, edited by Art P. Shimamura, 165-191. New York: Oxford University Press.

Smith, Tim J. 2014. “Audiovisual Correspondences in Sergei Eisenstein’s Alexander Nevsky: A Case Study in Viewer Attention.” In Cognitive Media Theory, edited by Ted Nannicelli and Paul Taberham, 85-105. London: Routledge/American Film Institute.

Smith, Tim J, and John M. Henderson. 2008. “Edit Blindness: The relationship between attention and global change blindness in dynamic scenes.” Journal of Eye Movement Research 2(2):6, 1-17.

Tatler, Benjamin W., Nicholas J. Wade, Hoi Kwan, John M. Findlay and Boris M. Velichkovsky. 2010. “Yarbus, eye movements, and vision.” i-Perception 1:7-27.

Treuting, Jennifer. 2006. “Eye Tracking and the Cinema: A Study of Film Theory and Visual Perception.” SMPTE Motion Imaging Journal 115:1, 31-40.

Turvey, Malcolm. 2012. “The Continuity of Narrative Comprehension.” Projections: The Journal for Movies and Mind 6:1, 49-56.

Võ, Melissa L.-H., Tim J. Smith, Parag K. Mital and John M. Henderson. 2012. “Do the eyes really have it? Dynamic allocation og attention when viewing moving faces.” Journal of Vision 12(13):3, 1-14.

Wass, Sam V., Tim J. Smith and Mark H. Johnson. 2013. “Parsing eye tracking data of variable quality to provide accurate fixation duration estimates in infants and adults.” Behavior Research Methods 45:1, 229-250.

Willemen, Paul. 1994. Looks and Frictions: Essays in Cultural Studies and Film Theory. Bloomington: Indiana University Press.

Yarbus, Alfred L. 1967. Eye Movements and Vision. Translated by Basil Haigh. New York: Plenum Press.


William Brown is a Senior Lecturer in Film at the University of Roehampton, London. He is the author of Supercinema: Film-Philosophy for the Digital Age (Berghahn, 2013) and, with Dina Iordanova and Leshu Torchin, of Moving People, Moving Images: Cinema and Trafficking in the New Europe (St Andrews Film Studies, 2010). He is the co-editor, with David Martin-Jones, of Deleuze and Film (Edinburgh University Press, 2012). He is also a filmmaker.

Our Sherlockian Eyes: the Surveillance of Vision – Sean Redmond, Jodi Sita and Kim Vincs


For this inter-disciplinary article, we undertook a pilot case study that eye-tracked the ‘Holmes Saves Mrs. Hudson’ sequence from the episode, A Scandal in Belgravia (Sherlock, BBC, 2012). This small-scale empirical study involved a total of 13 participants (3 males and 10 females, mean age was: 27 years), comprised of a mixture of academics and undergraduate students at La Trobe University in Melbourne, Australia. The article examines its findings through a range of threaded frames – neuroscience, forensics, surveillance, haptics, memory, performance-movement, and relationality – and uniquely draws upon the interests of the authors to set the examination in context. The article is both a reading of Sherlock and a dialogue between its authors. We discover that the codes and conventions of Sherlock have a direct impact on where viewers look but we also discover eyes emerging in the periphery of the frame, and we account for these ways of seeing in different ways.

My Sherlockian Eyes

Sean Redmond

I have always been fascinated, perhaps even obsessed, with my eyes. I have often felt them looking into things, as if they had their own embodied consciousness that I was entirely, simultaneously, conscious of. It was as if we, my eyes and I, saw the world separately and together, possessing a double vision, one set within the meaty windows of my sockets, and the other looking outside, grasping the world with a replete hapticity, sending shivers across my pupils and retinas as they did so.

I have found myself trying to catch my eyes out, to second guess their movements, their sightlines, and their interests. I must be a sight for sore eyes on the rush hour train, wrestling with what I will allow my eyes to see. I often try to resist my conforming eyes, to make them look towards the cultural periphery, to the aesthetic margins, and to the haphazard shards of broken, refracted light on oily windows that few others see as they go about their busy, and sometimes dreary lives. I have a deep yearning to see my eyes politicised, to turn them completely into organs of touch (Marks, 2000), and to feel them wander freely across the intricate layers of the film and television screen. I want Sherlockian eyes.

I have held a rather romantic notion about my viewing eyes, and the eyes of some viewers: that they sometimes wander freely across the spaces, objects, lights, colours, bodies, movements and sounds of the diegetic world they are presented with. Narrative action may be centre frame, and all the elements of the mise en scène may be attempting to draw one’s eyes to this interaction, but I will catch myself looking to the far left of the screen, to hold my sight on an obscure pattern on a wall, or to search for the origins of a distant minor or insignificant sound just off-screen. I want to see inside and outside the narrative simultaneously. I imagine my eyes as Sherlock-like, searching for narrative clues, new plot developments, and for the sensuous expression of character, mood and feeling. But I also see them loosing or freeing themselves; my eyes (unconsciously) float within all the elements of filmic or televisual material as they happen on the screen.

I see in Sherlock’s eyes this double vision: the ability to have foresight, to see into the margins of things, and to be consciously aware of the vision within and all around him. As Sherlock sees into the finest grain of things, so do my eyes and I. My Sherlockian eyes are forensic, haptic, self-processing and are blessed with twenty twenty vision – they have the power to see into all things clearly. Sherlock mirrors or rather embodies the very qualities of the cinema machine (Metz, 1982), and of the surveillance regimes (Foucault, 1977), that emerged at the time the first Sherlock Holmes books were written (1887-1927). Sherlock is a text that already embodies the eye tracking experience.

But is this so, or just a fictive longing? What evidence do we have that our eyes do what we say they do? What evidence do we have that viewers possess a double vision? This romantic, phenomenological notion of the viewing, carnal, haptic eyes, then, we wanted to test, to explore, to see in action and interaction…

The Science of the Sherlockian Eye

Jodi Sita

I can often be found staring off into space, deep in thought, looking at nothing in particular. If I were being eye tracked it would look like I was staring at something. Where people look and, more particularly, why they are looking there, are questions that fascinate me and make me think about the phenomenon of blank stares. Human beings have fascinating eyes that, because they are housed independently, with their own localized environments, need to and can move about quite a lot. A tiny spot on the back of the eye, the retina, houses the receptors for high visual acuity, and this spot must be directed at the object we want to see, in order for us to see it clearly, and see its fine detail. People tend to move their eyes to aspects of a scene which are interesting or useful. The visual system in the brain directs these movements; they are not random. Bottom–up control processes (see; Itti & Koch, 2000) help direct some shifts in visual gaze and involve features that are thought to attract attention due to their ability to be noticeable. These include salient features such as luminosity, colour and movement. More importantly, in the human gaze we know that top-down processes are at play when viewing complex or meaningful scenes; our eyes employ feature selection which are based on our understanding of the scene and our internal expectations about where important things are or are likely to occur (Torralba et al., 2006; Birmingham et al., 2008, and Vincent et al., 2009).

What I am curious to learn more about is how our viewing behavior is shaped by what we are doing, by how we are interacting with the world and how our brains are responding to that and shaping that encounter. Thus, my involvement in this work comes from these curiosities, however, it also stems from my own forensic tendencies. It develops from my own need to ask the Sherlockian questions about viewers viewing Sherlock.

My early research had me investigate a branch of forensic science; handwriting and signature examination. At the time the area had a lot of practitioners, many quite experienced and successful, and there had been a substantial amount written about it, yet very little objective evidence for the fields’ claims had been produced. What the field needed were studies which produced hard evidence to support or dispute its original claims. My work was part of a large and ongoing body of work, where the field’s ideas and claims are tested objectively, and whose results can be used as evidence to support existing notions or derive new ones. It was within this area of research that I started using eye tracking, yet it has also led me to want to bring eye tracking to this moving-image field; where the focus of the eye tracker, shining like an objective lens over some of the theories of the area, can help bring to it some other method for its practitioners to use – to examine how viewers watch and are involved with what they are watching.

The Optometry of Sherlock

Kim Vincs

The science of the gaze—of how eyes fixate or fail to fixate—has always been of great interest to me, firstly in my original career as an optometrist, and more latterly as a choreographer and then a transmedia dance artist. As an optometrist, I was less concerned with where people looked than with whether they could look, and with the accuracy and resolution of the sensory information they received and interpreted when they did look. What do I mean by this? In considering whether people could look, I am referring to whether they were able to accurately fixate the static and moving targets they wanted and needed to. Fixating, that is, aiming one’s eyes at a static or moving target, is a function of attention, and integrated as action by the sensory and muscular systems of the eye and brain. There are many pathological conditions that interfere with the capacity to fixate a static visual stimulus quickly, accurately and efficiently. As an optometrist, I was primarily concerned with detecting these conditions and referring patients who had them for appropriate treatment. I was, in a very real sense, perfectly happy to allow my patients to decide for themselves what to fixate on. My job was simply to ensure that, should they wish to, they would be capable of locating and tracking something. This willingness to allow dissociation between capacity and will, between ability and decision, is something I consider foundational to the ways in which I have pursued my subsequent research into creative practices. I have never, as a choreographer or an interactive / transmedia artist, wished to dictate to people where they should look or what they should perceive. I consider my job to be to place appropriate objects / events / movements within a context in which they can be perceived should people so choose.

This outlook has had some specific implications for my art practice. As a choreographer, I have never thought to ask what someone watches when they observe a dancer moving. Cognitive psychologist Kate Stevens’ seminal work on eye movements in dance has demonstrated a classic novice/expert shift in the way that observers view dance. As with many other fields of expertise such as airline pilots, and driving instructors, experts make significantly fewer saccades, that is, changes in fixation, watching a dance performance than do novices, where experts are people with professional experience in dance and novices are people with no particular prior experience of the artform (Stevens et al., 2010). The implication of these results is that experts do not need to change fixation as many times as novices because they are able, to some extent, to predict where the dancing body will move. In essence, they understand what they are looking for, and are therefore able to maximize the efficiency of their fixation choices.

What Steven’s work does tell us is which movement features most attract fixation when watching a dancing body. My own work in motion capture analysis of dance movement provides me with a theory about why this might be a difficult thing to measure. Dance, at the movement level, comprises movement of some 33 major joints, each of which may make movements of entirely different velocity, acceleration and magnitude to achieve an overall aesthetic effect. The dancing body essentially has no ‘centre of focus’ that can be interpolated from movement data such as the speed, momentum or even position of specific body parts, because the semantics of dance movement are only meaningful in relation to the composition across the body. As I have argued previously, (Vincs, 2014) the semantic significance of a movement bears no relationship to its metrics, such as amplitude or speed. In some aesthetic contexts, tiny movements of the fingers may be essential to the meaning and feeling tone of the movement. In others, such as large virtuosic or acrobatic movement forms, hand gestures may contribute relatively little to a movement’s significance.

Dance grammars are aesthetically and culturally, rather than anatomically determined. I think that this fact has also contributed to my attraction to the notion of Sherlockian eyes. As a choreographer, I am always a detective, seeking potential significances in movement rather than predetermined ones. I value the opportunity to go looking for the dancing body, browsing, shuffling, wandering through the multiple and complex joint actions that comprise a single ‘step,’ looking for something of newness and emotional value rather than assuming I know what it is and where I will find it. Yet I am always aware that my aesthetic search is underpinned by a neurosensory apparatus that is primed to respond selectively to human movement (Hagendoorn, 2004, Vincs, 2009). I am therefore armed, at least potentially, with an inherent ‘grammar’ that is defined by the morphology and physical capacity of the human body, and I am curious as to what predilections and biases my visual sensory system imposes on my seemingly adventurous gaze.

What now follows is an exploration of our different approaches to the eye tracking data that we generated. Jodi is first off, outlining our empirical method and undertaking a close reading of the preliminary results. Jodi shows how the results begin to tell us that the viewers’ gaze patterns and fixations are closely clustered together, and she situates these findings in relation to the science of the eye, the importance of the face in human communication, and to the visual and narrative codes and conventions of Sherlock. Sean then explores the results in terms of haptic visuality and the surveillance gaze, drawing upon phenomenology and the discourses of conspiracy to argue that the vision in Sherlock is marked by touch, texture, and control. Kim examines the results in terms of movement and relationality, examining the eye tracking data in terms of the way it supports and confirms the necessary nature of vision in seeing into moving things. Kim shows that even though there is a high degree of direction in terms of where viewers are being asked to look, visual perception allows or enable the eyes to wander. Finally, we conclude our article together, drawing together our voices to offer an interdisciplinary way forward.

Eye Tracking Sherlock (the objective viewing): Methods and Preliminary Results

Jodi Sita

We undertook a pilot case study that eye-tracked the ‘Holmes Saves Mrs. Hudson’ sequence from the episode, A Scandal in Belgravia (Sherlock, BBC, 2012). This small-scale empirical study involved a total of 13 participants (3 males and 10 females, mean age was: 27 years), comprised of a mixture of academics and undergraduate students at La Trobe University in Melbourne, Australia.

A Tobii X-120 remote eye tracker (Tobii Technology, Stockholm, Sweden) was used to record participants eye movements which has an accuracy of 0.5ºof visual angle and allows a moderate amount of free head movement (30 x 22 x 30 cm at 70 cm (Width x Height x Depth)). This data collection technique uses reflected infrared light from the eye to determine participants viewer gaze positioning and allows for natural head movements and natural human responses to screened material. The eye tracker was connected to a PC running an Intel ® core ™ i7 CPU ‘Cool Master’ hard-drive. The eye tracker used Tobii Studio 2.3.2 professional edition software for the presentation of the movie scene stimuli and recording eye movements. The eye tracker was set up on a desk, situated below a Dell PC monitor (1680×1050), which was utilised by the participants to view the Sherlock sequence. Participants were seated on a sturdy chair between 55-65cm away from the eye tracker and between 65-75cm from the viewing screen. A second screen (Dell; 1920×1080) was utilised by the researchers to view, in real time, the eye movements of the participants as they were being tracked and calibrated, although all computer analyses and statistics reported here was based on stored data.

Participants were recruited via posters advertising the study at La Trobe University, with ethics approval (Ethics approval number: FHEC13/101). Participants were required to be at least 18 years of age to be considered eligible. People whom expressed interest in taking part in the study were contacted via email to attend a single recording session. In preliminary tests participants were introduced to the study and screened for exclusion criteria such as taking medications (e.g. benzodiazepines) that may potentially affect their eye movement, known neurological conditions, disorders or injuries that could potentially affect their eye movement. All participants were screened for normal or corrected to normal near visual acuity of N8 or better on the Designs for Vision near sighted visual acuity test, and with a pen-torch eye movement excursion test to screen for symmetrical movement of the eyes. Participants who were ametropic were allowed to wear their glasses to watch the stimuli.

Prior to eye movement data being collected, the eye tracker was configured for each participant using a 9-point on screen calibration test within the Tobii Studio recording software. Participants were told only that they would watch short segments from a variety of films. Recording sessions typically lasted between 15-25 minutes, and each participant was tested individually.

First, we found that our viewers’ eyes were strongly drawn to follow movement and directional cues and signs. This included camera and character movement. In the opening scene, where Mrs. Hudson’s fingers scrape along the wall, followed by Sherlock’s fingers retracing her steps (03-010 seconds), we see all viewers making strings of successive fixations – each following these finger movements (see Figures 1 and 2). The sound of these fingers scraping along the wall was heavily amplified, and fully sychronised, and we suggest, then, that sound was also an aesthetic device being employed to direct where viewer’s looked. These results confirm previous findings where camera movement, sound, character behavior, and editing patterns are seen to inform gaze patterns and fixations (see Smith, forthcoming, Smith and Mital, 2011).

In one brief shot in the middle of this scene, we cut to a close-up of Mrs. Hudson’s face, full of anguish. All the subjects discern this face in amongst the movement and chaos of the surrounding action, as seen by their fixating to its features. Her face is captured in the center of the screen, making it central to the scene’s visualisation. However, the face is known to be a strong attractor of what the eye attends to (Treuting (2006)), and with it being such an important narrative component in this scene, would have been a strong attentional cue.

Figure 1: Finger drag: 2 subjects

Figure 1: Finger drag: 2 subjects

Figure 2: Finger drag: 13 subjects

Figure 2: Finger drag: 13 subjects

Second, we observed an alignment in vision with regards to where Sherlock was looking. This sight co-proximation is referred to as ‘joint attention,’ in which what one attends to seems to shift automatically to where another is looking (Birmingham et al, 2009). Interestingly, this is a common misdirection trick used by magicians (Kuhn, et al, 2009).

In particular, we observed that Sherlock’s’ point of view in the scene very often produced a close proximity in viewers focus and attention (participant’s looked where Sherlock looked, and with the same overall gaze patterning, see from 1.05 to 1.14). This also supports the findings reported by other film scholars using eye tracking methods, such as Rassell et al. (forthcoming, 2015) who found that a character’s point of view and subjective experiences have an influence on where  viewers look.

The trends for this short sequence support the idea that Sherlock is a character-driven drama in which his vision is not only foregrounded but given omnipotent and omniscient power. Thus, viewers are not only being positioned to observe from his authorial position but to trust where he looks and what he discovers there. There are recognizable genre codes and conventions also in play, structuring the looking patterns we have observed. This is a detective-thriller series that repeats a series of camera and editing motifs that become familiar to audiences (Neale, 1990).

Sherlock Image 3

Figure 3: A heap map showing the hot spots where viewer’s gazed; red indicating longer dwelling time

Figure 4: The gaze plots showing the sequence of looks that viewers made over Sherlock’s face

Figure 4: The gaze plots showing the sequence of looks that viewers made over Sherlock’s face

Third, we found that viewers focused heavily on the characters faces, both in scenes with dialogue and those without. In scenes where Sherlock was clearly putting together the evidence, viewers focused heavily on his eyes, dwelling there for almost the entire shot. (Figures 3 and 4).

Viewers fixated back and forth between the eyes, face, and mouth of the central characters. These viewing patterns are characteristic of the movements made in facial and emotional recognition (Ekman & Friesen, 1971; Hernandez, et al, 2009) and show some indication that viewers were paying attention to the different character’s in the scene, working out the role of each character and what their intentions and emotions might be. These patterns of eye movements suggest that viewers are engaging with the scene as they would in a normal face-to-face encounter, using eye movements to verify who people are and what they are feeling. It is interesting to note that people who are not able to perform these socially informative tasks, such as those with the disorder schizophrenia, and with some traumatic brain injuries, do not show the same eye movement behaviors (Watt & Douglas, 2006; Loughland et al, 2002; Williams,et al, 1999).

Our viewers clearly followed narrative cues in line with the dialogue exchanges, looking back and forth between the character’s interpersonal relays (Figures 5, 6 and 7). These results are similar to those of Treuting (2006), who eye-tracked 14 participants viewing short clips from such films as Shawshank Redemption (1994) and Harry Potter and the Philosopher’s Stone (2001). Treuting found that gaze clusters emerged in and around the central character’s faces involved in dialogue and moments of heightened drama (see also, Redmond, Sita, 2013).


Figure 5: Single viewer character alignment

Figure 6: 6 subjects and the relay of looks on eyes, mouths and faces.

Figure 6: 6 subjects and the relay of looks on eyes, mouths and faces.

Figure 7: Final scene, last 12 seconds, searching for information: 13 subjects (most of the fixations are falling over the faces of the 2 central character’s in dialogue)

Figure 7: Final scene, last 12 seconds, searching for information: 13 subjects (most of the fixations are falling over the faces of the 2 central character’s in dialogue)

Fourth, we saw evidence that viewers searched for narrative information and cues: this included fixating on aspects of the background wall before Sherlock first enters the scene (from 0.33-0.36 seconds), then moving between the image of a smile seen on the wall and Holmes’ face, spending time ‘reading’ the shop window signs and the note on the front door (Figure 8) as Watson arrives at the scene (from 2.22 to 2.37). One can understand such scanning as influenced by the meticulous work of the mise en scène: where all the elements have been carefully placed to enact this type of searching for narrative cues (see Smith, forthcoming).


Figure 8: Searching for narrative information

Finally, albeit in relation to our last point, we observed that certain viewers looked at more elements of the mise en scène (Figure 9, shows gaze patterns for 4 of the 13 viewers), including the interior lights, the computer, and furniture, even as the more dramatic moments of the scene were taking place.

These findings were interesting but not totally unexpected; we would hope that not all people viewing the same scene would watch it in the same way (this is something that is discussed further below). Insights like this allow us to see that even though there are some aspects that are strong attention grabbers, such as faces and movement within a scene, other aspects can captivate and draw attention away from those areas of interest. For example, the scene shown below (Figure 9) involves a particularly emotive exchange between two key characters, Mrs. Hudson and Dr. Watson. The fact that 4 of the viewers were attending elsewhere helps us to see these aspects of interest outside the main narrative at play. Why certain viewers look to the margins of the screen, to the more ‘insignificant’ elements of the mise en scène remains of great interest. One possibility would be that these viewers were not fully engaged with the exchange between the characters, and their attention therefore drifted to other elements in the scene. Another possibility is that these scenic elements drew particular interest because of their pattern, colour, etc. Further testing is needed to begin to tease out whether this response is scene-dependent, or a characteristic of these particular observers.

Figure 9: A slightly different patterning – 4 subjects – and wider viewing

Figure 9: A slightly different patterning – 4 subjects – and wider viewing

It should be noted, nonetheless, that these observations come from only a very small sample (13 participants to date), which will be increased, and which still needs to undergo further data analysis and interpretation.

In summary, what have we see in these results so far? Evidence of the eyes being held to attention by narrative cues, by camera and character movement, faces, dialogue, point of view and performance. These were elements to be expected and add to the growing body of eye tracking evidence that supports much of current film and television theory, particularly those working in the cognitive tradition such as David Bordwell (2007) and Noël Carroll (1996). The results equally support the results of other studies into; narrative centered visual texts (see Batty et al, forthcoming, 2015); and how sound and movement affect gaze patterns (Rassell et al, forthcoming, 2015) Further, they speak to the way viewers are pulled seamlessly into the diegetic worlds they believe and invest in.

In this article we would now like to apply two different theoretical filters to the results just summarized; the first will be an examination of the gaze, by Sean, and the second will be an examination of the physiological and perceptual processes of the eye in relation to movement, by Kim. Both filters will attempt to make deeper sense of the results from the traditions in which the scholars operate from. Following this analysis, a summary conclusion will draw their approaches together to make further inter-disciplinary sense of Sherlock’s eyes.

Sherlock’s Gaze

Sean Redmond

The concept of the gaze has a long and contentious history in Film Studies if much less so in the study of television. In fact, John Ellis has suggested that the domestic context in which television viewing has historically taken place, with a host of likely distractions, and in a context of constant programme flow and segmentation, produces a glance aesthetic whereby the image isn’t looked into deeply or for a sustained period of time (1982: 138). Sherlock, of course, contests this idea since the programme is heavily built around the details of forensic gazing.

Most notably, the idea of the gaze has been employed in psychoanalytical film theory to make the argument that the cinema looking apparatus is patriarchal and heterosexual, and viewers are positioned as ‘male’ subjects through which masculine identifications emerge (Mulvey, 1975, 1989). Its main male characters, and male writers and directors of course control the vision regime in Sherlock, although its objects of focus are rarely to-be-looked-at female characters.

Critical race theory, by contrast, has employed the concept of the gaze to demonstrate how the racial Other is fixed in inferior, marginal and fetishized subject positions (Hall, 2001). Sherlock can be read as a post-colonial text enacting a present England that centres whiteness and ‘invisibly’ marginalizes the Other from its panopticon empowered centre (Cuningham, 2004). The Other in Sherlock of course extends to those who sit outside the bourgeois social centre; there are particular class dimensions to the way crime is surveyed and defined (Jann, 1990).

In terms of surveillance discourse, film has been read as a vision machine set within a invasive visual culture that promotes:

the normalizing gaze, a surveillance that makes it possible to qualify, to classify and to punish. It establishes over individuals a visibility through which one differentiates and judges them (Foucault, 1977: 25).

Sherlock can be read as a text that carries out this normalizing gaze, defining the parameters of law and order and the way the criminal can be discovered, classified, and ultimately disciplined. That is not to say that the visual excesses of the programme do not at times undermine its simple binaries. To the contrary, Sherlock is constantly troubled by its own dominant discourses particularly through the way Sherlock is also a maverick outsider.

Finally, film phenomenology has made use of the gaze to demonstrate how looking and seeing is always embodied, experiential, and depending on the text, haptic and synesthetic – where ‘the eyes themselves function like organs of touch’ (Marks, 2010, 162). Sherlock creates the conditions of both embodied presence and haptic visuality through the way the gaze is employed to see deeply into things, while the programmes textural mise en scène ‘demands’ to be attended to.

What I would now like to do is analyse two particular aspects of the way the gaze can be understood in Sherlock, relating my reading back to the eye tracking results that we have, and to eye tracking technology itself. First, I will explore Sherlock through its forensic gazing and the way this creates the particular conditions for the way viewers become locked into particular viewing patters and relations. Second, I will explore Sherlock through its haptic elements whereby the viewer is understood to gaze at and touch (things) simultaneously.

The Forensic Gaze

In Sherlock one can argue that camera movement and position are motivated by the following factors. First, to reveal narrative information such as a new location, or setting; character relations and their relative physical proximity; time, and temporal detail; and moments of revelation where a new angle or focus reveals something previously hidden or a new ‘enigma’ emerges. Second, as a dramatic device: the camera is re-positioned to signal and cue moments of narrative development, crisis, reaction, and activation. Third, there are repeated and recognised televisual conventions of the programme: one can locate and expect certain camera motifs to function in Sherlock, such as the way we enter Sherlock’s mind’s eye to see what he is unearthing in microscopic close-up. Finally, camera movement and position signals certain emotional states and modes of feeling. The cut to a close-up, for example, a moment of affecting intensity, such as is the case with the fingers being scraped along the wall in the scene under analysis in this article.

When one takes these Sherlockian codes and conventions into consideration one can make better sense of the eye tracking results that we have gotten. The eyes of the viewer seem to be relentlessly led and directed. Viewers familiar with the programme’s codes can be expected to have expectations of its visual tapestry, and to make predictions about where to look (see Rassell et al. forthcoming). This would explain both the way that viewers seem closely aligned with the looking operations of the scene (figures 1-7), and the way that viewers scan shots for narrative information (figures 8-10).

However, I also think there is something more telling to discover here – one around a consistent forensic looking regime where the text and the viewer align. This is the ‘double vision’ we refer to in my introduction to this article. Viewers come to embody the gazing powers that Sherlock possess and look at the diegetic world through his eyes even where no direct or imagined point of view is in operation. Viewers experience their very own form of social surveillance becoming detectives and snoopers in the process. Sherlock, then, can be read as a text of and for paranoid surveillance, fuelled by the constant search for facts, omissions, falsehoods, and half-truths. At a more general cultural level, trust is at issue here in what is perceived to be an age of ‘faithless’ activity and widespread corruption, where politicians are regarded to be as corrupt as the criminals they covertly support. Sherlockians ultimately become part of this age of conspiracy (Knight, 2002).

As does, in a very real sense, eye tracking technology and the data it produces. Sherlock is his very own eye tracker – he creates his own heat maps and relays and through this inbuilt biotechnology he sees into everything. Eye tracking technology is Sherlocklian and the data it produces allows us to see into everything the viewer sees. Or, at least mostly…

The Haptic Gaze

Laura U Marks (2010) has written that haptic visuality is a more intimate form of looking, where the eyes, ‘move over the surface of its object rather than plunge into illusionist depth, not to distinguish form so much as to discern texture” (162). For Marks, film and video may be “thought of as impressionable and conductive, like skin.” (2000: xi-xii) and this sensory materiality is heightened by it containing:

Grainy, unclear images; sensuous imagery that evokes memory of the senses (i.e. water, nature); the depiction of characters in acute states of sensory activity (smelling, sniffing, tasting, etc.); close-to-the-body camera positions and panning across the surface of objects; changes in focus, under- and overexposure, decaying film and video imagery; optical printing; scratching on the emulsion; densely textured, effects and formats such as Pixelvision… and alternating between film/video.  (Totaro, 2002)

The gaze found in Sherlock is very often a haptic one. The programme’s entire mise en scène evokes the activity and memory of sensation. Lights, objects, clothes, furniture, exteriors are given deep and layered textures. Sherlockian environments are populated with objects and qualities that are themselves sensory driven (poison, oil, tactile fabrics, beads of sweat, cigarette smoke, wet soil). The camera very often dwells on these, picks them out, and tracks and pans over them in close and proximate detail. Sherlock of course is a master of haptic visuality – his eyes touches the things that he observes or that he conjures up in his imagination. In many respects, then, the viewer is also invited to see Sherlock through a haptic lens.

If one was to return to the eye tracking results on the scene what we might be observing is not just an alignment in vision, and the search for narrative information, but eyes that have been turned into organs of touch and deep sensual appreciation. For example, in image 1 and 2 viewers are not just following the fingers that scrape along the wall but touching (with) them, and in touching them feeling them as if it is their fingers suffering this pain. In figures 5-7 viewers are not just following the relay of looks between the two characters but ‘touching’ faces, eyes and mouths. In figures 8-10, viewers are not just searching for narrative information and clues but are actively seeing into the textures, lights, objects, items that populate those scenes. The heat maps that eye tracking technology can generate may be more apt than we imagine since the suggestion of temperature, of body-heat, may well give truth to the embodied and carnal nature of vision. This is one of the limitations of eye tracking technology, however, since it cannot tell us what people are feeling when watching a film or television text.

I would like to make one final observation about the eye that searches the mise en scène for narrative information or clues, as in figures 8-10. This is a point about the privacy or individualism of watching a screen text so co-dependent it is on personal memory, biography, and the contexts one finds one self-viewing something. After the viewing of the scene one of our subjects remarked that they had actually spent much of the time trying to figure out whom the actor was playing Dr Watson. Any number of ‘personal’ factors might get in the way of the looking regime of the text and for why we might scan a particular text.

Roland Barthes (1981) has usefully employed the concept of the punctum (a Latin word derived from the Greek word for trauma) to viewing photographs. He argues that the still image inspires an intensely private meaning, one in which an affecting ‘partial object’ emerges from its centre to ‘prick’ or ‘wound’ the viewer. The punctum is personal and as soon as it emerges it holds the viewer’s gaze. Although Barthes is singularly writing about the photograph I think the idea of the punctum can be applied to the moving image text, to Sherlock. Although a dynamic media, television and film still settle on images and representations that reach out into the private realm of the viewer; and the viewer still finds their memories, traumas, life events activated and mobilised in the fictive worlds constructed. The wandering eyes of figures 8-10 are caught in their own biographical exploration, looking for objects that may ultimately wound them, their haptic eyes. In Sherlock we are not just being positioned as objects and subjects of surveillance but as carnal beings.

The Eyes of Sherlock; the Eyes of the Viewer

Kim Vincs

Coming, as I do, from the perspective of a dancer and choreographer, I read these eye tracking results not exactly as touch, but as a search for relationality. Erin Manning, in her seminal work on the philosophy of movement (Manning, 2009) emphasizes the incipiency of movement—movement as something in the act of becoming something, of reaching towards something or someone—over its positionality. That is to say, movement, in Manning’s terms, is always a process of relationality or reaching towards the world, rather than simply a series of coordinates on a grid. Bodies, or, more precisely “bodies-in-the-making,” are a means of thought rather than simply of action because they define, by the possibilities embedded in the moment of pre-articulation, a relationship or set of relationships within the world (Manning, 2009: 78).

Our eye movement data defines very clearly the relationships between the protagonists. In figures 1 and 2, ‘finger drag,’ the fixations map the pathway of Mrs. Hudson’s fingers along the wall. The relationships between Mrs. Hudson’s fingers and the texture of the wall are, in fact, the only potential human relationships within the scene. These relationships are not ‘positional’— that is to say, fingers defining points on the wall—but, by virtue of their spatial distribution, an articulation of the trajectory of Mrs. Hudson’s hand in relation to the wall.

Figures 3—6 reveal fixation patterns that are concentrated around the face, and in particular the eyes. These fixations address the origin of relationality in the scenes within the eyes, as if the eyes are understood to reveal the incipient thought of the characters. In Manning’s terms, incipiency the moment in which a movement is in a state of ‘pre-acceleration,’ organizing and mobilizing itself, yet still capable of any number of actual outcomes, is the most potent aspect of a movement. The predominance of fixations on the eyes and face, while perfectly interpretable in as simply a biological reflex designed to respond to and recognize human faces, also speaks of the process by which relationality is thought into being.

Figures 7—10, with their fixations distributed between human-to-human gaze and gesture, (Figure 7) non-human scenic elements that lend cognitive elements or ‘clues,’ (Figure 8) and poetic elements such as the look and texture of surrounding objects in Figures 8 (curtain texture) and 10 (lens flare), articulate an expanded notion of relationality in which human and non-human elements are implicated within a web of actions.

For me, these results suggest a mutability between human and non-human elements that is reminiscent of Manning’s understanding of relationality as something that can have both interpersonal and person-world dimensions, and also points to the kind of ‘Sherlockian body’ I seek as a dance artist. A purely ‘narrative’ approach to the filmic bodies analysed here would suggest that only human factors would feature in the gaze analysis. Similarly, a biologically driven notion of human movement perception as pre-wired to detect human shapes and actions would not predict migrations to and from the bodies to surrounding objects.

I read these results as revealing a semantic ambiguity that can form the basis for a Sherlockian search for relationality that is produced and constructed by the viewer as much as it is dictated by the film-maker. These scenes offer relatively few visual details from which to construct such a relationality, and this, no doubt, is indicative of a film-making style designed to direct the eye, and hence the mind, to very specifically and deliberately arranged narrative events. However, despite the seeming prescriptiveness of these images, they offer the viewer an opportunity, to construct relational scenarios across like and non-like (human and non-human) scenic elements. They therefore demonstrate at least the potential for a Sherlockian approach to the visual perception of movement.


So, what have we seen with your eyes? On the one hand, we have demonstrated how Sherlock’s narrative and mise en scène pulls the viewer into taking certain (emotive, forensic, relational) viewing positions. Sherlock is tightly bound by a number of codes and conventions and its palette and composition are highly constructed, creating spaces and interactions that focus our eyes. On the other hand, we have looked at the way haptics and relationality open up the possibility for better understanding the synaesthetic and organic/inorganic ways through which the vision of movement takes place. A televisual text such as Sherlock intends to marshal our viewing experience but as we have also seen, eyes wander; they search, their own movement and the poetics of movement opening up textual encounters not always pre-determined by the scenes deliberate operations. Our eyes escape themselves. Finally, we have noted how memory and biography can impact on where and why we look, or why we might look away. One of our respondents searched Watson’s face in the hope of remembering the actor’s name. The brutal violence metered out on Mrs. Holmes may be looked at (felt) differently by someone who has themselves undergone such misery. This is ultimately about our-being-in-the-world, and about how we make sense of beings-in-the-fictive world. Vision is never disembodied but full of the drink, food and love of life itself.

Vision needs to be understood as something that can only be fully understood through combining different theoretical and methodological frameworks. What we have found in this article, and in the broader work of the Eye Tracking the Moving Image Research group – see the introduction to this special edition for more on this group – is that it is in the conversations and deliberations, provocations and discussions between vision scientists, neuroscientists, anatomists, choreographers, film makers, ethnographers, screen theorists, and screen writers where insights are best made, conclusions thickened, and arguments enriched and extended. What we have found in this article is that Sherlock exists at the eye of the ArtsScience nexus, and it is at this nexus where the authors would like to situate their work.

The language of film and television constantly creates these spaces of vision and for seeing to take place, whether this be; the embodied point-of-view shot which allows us to become the character; aerial cinematography that brings wide open exteriors and bejeweled cityscapes into view; the furtive camera that glimpses into dark corners, allowing us to happen on to what is supposedly hidden; or the interiorized gaze that expressionistically captures the nightmare visions of the lost, the hunted, and the alien. With a close-up shot one can trace the undulating valleys of emotion on a character’s face and feel their affecting eyes reaching out and into yours. In Sherlock, the very force of his vision, mobilized through special effects and the power of digital photography, enables his/our eyes to create mathematical formulations out of thin air and to re-visit crime scenes as if one is witnessing, experiencing it all over again. Sherlock establishes the sense that all vision is embodied and personally engineered, and to the wider conceit that television is an artform that gives a miraculous, omnipotent and omnipresent vision to its ever watchful viewers.



Barthes, Roland. 1981. Camera lucida: Reflections on photography. Macmillan.

Batty, Craig, Dyer, Adrian, Perkins, Claire and Sita, Jodi. 2015. “Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative.” In Making Sense of Cinema, edited by CarrieLynn D. Reinhard and Christopher Olson, New York: Bloomsbury, forthcoming.

Birmingham, Elina, Bischof, Walter. F., and Kingstone, Alan. 2008. “Gaze selection in complex social scene.” Visual Cognition 16 (2–3): 341–355

Bordwell, David. 2007. Poetics of Cinema. London: Routledge.

Carroll, Noël. 1996. Theorizing The Moving Image, Cambridge, Cambridge University Press.

Cuningham, Henry. 1994. “Sherlock Holmes and the Case of Race”. The Journal of Popular Culture, 28: 113–125. doi: 10.1111/j.0022-3840.1994.2802_113.x

Ellis, John. 1982. Visible Fictions, London: Routledge.

Ekman, Paul and Friesen, Wallace, V. 1971. “Constants across cultures in the face and emotion.” Journal of Personality and Social Psychology Vol 17 (2): 124-129.

Foucault, Michel. 1977. Discipline and Punish: the Birth of the Prison, New York: Random House.

Itti, Laurent and Koch, Christof. 2000. “A saliency-based search mechanism for overt and covert shifts of visual attention.” Vision Research 40 (10–12): 1489–1506.

Jann, Rosemary. 1990. “Sherlock Holmes codes the social body”. ELH: 685-708.

Knight, Peter. (Ed.). 2002. Conspiracy nation: The politics of paranoia in postwar America. New York: New York University Press.

Kuhn, Gustav, Tatler, Ben, and Cole, Geoff. 2009. “You look where I look! Effect of gaze cues on overt and covert attention misdirection.” Visual Cognition 17 (6/7): 925-944.

Hagendoorn, Ivar. 2004. “Some speculative hypotheses about the nature and perception of dance and choreography”. Journal of consciousness studies, 11, 79 – 110.

Hall, Stuart. 2001. :The Spectacle of the Other”. Discourse theory and practice: A reader, Routledge: 324-344.

Hernandez, Nadia, Metzger, Aude, Magné, Rémy, Bonnet-Brilhault, Frédérique, Roux, Sylvie, Barthelemy, Catherine, and Martineau, Joëlle. 2009. “Exploration of core features of a human face by healthy and autistic adults analyzed by visual scanning.” Neuropsychologia 47(4): 1004-1012.

Loughland, Carmel, M., Williams, Leanne. M., and Gordon, Evian. 2002. “Visual scanpaths to positive and negative facial emotions in an outpatient schizophrenia sample.” Schizophrenia Research, 55: 159–170.

Manning, Erin. 2009. Relationscapes: movement, art, philosophy. Kindle edition Cambridge, Mass: MIT Press.

Marks, Laura U. 2000. The Skin of the Film: Intercultural Cinema, Embodiment and the Senses, Duke University Press.

Metz, Christian. 1982. The imaginary signifier: Psychoanalysis and the cinema. Indiana University Press.

Mulvey, Laura. 1975. “Visual Pleasure and Narrative Cinema”. Screen, 16(3), 6-18.

Mulvey, Laura. 1989. Visual and Other Pleasures. London: Macmillan.

Neale, Steve. “Questions of Genre”. Screen: 31(1) (1990): 45-65.

Rassell, Andrea, Redmond, Sean, Robinson, Jenny, Stadler, Jane, Verhagen, Darrin and Pink, Sarah. 2015. “Seeing, Sensing Sound: Eye Tracking Soundscapes in Saving Private Ryan and Monsters, Inc.”. In Making Sense of Cinema, edited by CarrieLynn D. Reinhard and Christopher Olson, New York: Bloomsbury, forthcoming.

Redmond, Sean and Sita, Jodi. 2013. “What eye tracking tells us about the way we watch films”, 5th December, The Conversation. Accessed 5th September 2014.

Smith, Tim, J. 2013. “Watching You Watch Movies: Using Eye Tracking to Inform Cognitive Film Theory.” In Psychocinematics: Exploring Cognition at the Movies, edited by Arthur P. Shimamura, Oxford University Press.

Smith, Tim. J. and Mital, Parag. K. 2011. “Watching the world go by: Attentional prioritization of social motion during dynamic scene viewing.” [conference abstract]. Journal of Vision, 11(11): 478.

Stevens, Catherine, Winskel, Heather, Howell, Claire, Vidal, Lyne, Latimer, Cyril, and Milne-Home, Josephine. 2010. “Perceiving Dance: Schematic Expectations Guide Experts’ Scanning of a Contemporary Dance Film. Journal of Dance Medicine & Science, 14(1), 19 – 25.

Torralba, Antonio, Oliva, Aude., Castelhano, Monica. S., Henderson, John, M. 2006. “Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search.” Psychological Review, 113 (4): 766–786.

Totaro, Donato. 2002. “Deleuzian Film Analysis: The Skin of the Film”. Off-Screen,  (accessed August 1st, 2011).

Treuting, Jennifer. 2006. “Eye tracking and cinema: A study of film theory and visual perception.” Society of Motion Picture and Television Engineers, 115(1): 31-40.

Vincent, Benjamin.T., Baddeley, Roland, Correani, Alessia, Troscianko, Tom, and Leonards, Ute. 2009. “Do we look at lights? Using mixture modelling to distinguish between low- and high-level factors in natural image viewing”. Visual Cognition, 17 (6–7), 856–879.

Vincs, Kim, and Barbour, Kim. 2014. “Snapshots of Complexity: using motion capture and principal component analysis to reconceptualise dance”. Digital Creativity, 25(1) 62-78.

Vincs, Kim, Schubert, Emery, and Stevens, Catherine. 2009. “Measuring responses to dance: is there a ‘grammar’ of dance?” Proceedings of the World Dance Alliance Global Summit, Brisbane, July 14-16, 2008.

Watts, Amber, J., and Douglas, Jacinta, M. 2006. “Interpreting facial expression and communication competence following severe traumatic brain injury”. Aphasiology, 20, 707–722.

Williams, Leanne, M., Loughland, Carmel, M., Gordon, Evian. and Davidson, Dean. 1999. “Visual scanpaths in schizophrenia: Is there a deficit in face recognition?”. Schizophrenia Research, 40: 189–199.



Sean Redmond is an Associate Professor in Media and Communication at Deakin University. He has research interests in film and television aesthetics, film and television genre, film authorship, film sound, stardom and celebrity, and film phenomenology. He has published nine books including The Cinema of Takeshi Kitano: Flowering Blood (Columbia, 2013), and Celebrity and the Media (Palgrave, 2014), and with Su Holmes he edits the journal Celebrity Studies. Sean Redmond and Jodi Sita set up the Eye Tracking the Moving Image Research group in 2011.

Jodi Sita is an academic working within the areas of neuroscience and anatomy with expertise in eye tracking research. She has extensive experience, with multiple project types using eye tracking technologies and other biophysical data. Her current research uses eye tracking to study viewers gaze patterns while watching moving images; to examine expertise in Australian Rules Football League coaches and players and to examine the signature forgery process.

Professor Kim Vincs is Director and founder of Deakin Motion.Lab, at Deakin University. Kim integrates scientific and artistic approaches through research. She is currently working on a three-year project, supported by the Australian Research Council’s Discovery program. Her collaborations with mathematicians, biomechanists and cognitive psychologists span Deakin and the Universities of Sydney, Western Sydney and New South Wales.

From Subtitles to SMS: Eye Tracking, Texting and Sherlock – Tessa Dwyer


As we progress into the digital age, text is experiencing a resurgence and reshaping as blogging, tweeting and phone messaging establish new textual forms and frameworks. At the same time, an intrusive layer of text, obviously added in post, has started to feature on mainstream screen media – from the running subtitles of TV news broadcasts to the creative portrayals of mobile phone texting on film and TV dramas. In this paper, I examine the free-floating text used in BBC series Sherlock (2010–). While commentators laud this series for the novel way it integrates text into its narrative, aesthetic and characterisation, it requires eye tracking to unpack the cognitive implications involved. Through recourse to eye tracking data on image and textual processing, I revisit distinctions between reading and viewing, attraction and distraction, while addressing a range of issues relating to eye bias, media access and multimodal redundancy effects.

Figure 1

Figure 1: Press conference in ‘A Study in Pink’, Sherlock (2010), Episode 1, Season 1.


BBC’s Sherlock (2010–) has received considerable acclaim for its creative deployment of text to convey thought processes and, most notably, to depict mobile phone messaging. Receiving high-profile write-ups in The Wall Street Journal (Dodes, 2013) and Wired UK, this innovative representational strategy has been hailed an incisive reflection of our current “transhuman” reality and “a core element of the series’ identity” (McMillan 2014).[1] In the following discussion, I deploy eye tracking data to develop an alternate perspective on this phenomenon. While Sherlock’s on-screen text directly engages with the emerging modalities of digital and online technologies, it also borrows from more conventional textual tools like subtitling and captioning or SDH (subtitling for the deaf and hard-of-hearing). Most emphatically, the presence of floating text in Sherlock challenges the presumption that screen media is made to be viewed, not read. To explore this challenge in detail, I bring Sherlock’s inventive titling into contact with eye tracking research on subtitle processing, using insights from audiovisual translation (AVT) studies to investigate the complexities involved in processing dynamic text on moving-image screens. Bridging screen and translation studies via eye tracking, I consider recent on-screen text developments in relation to issues of media access and linguistic diversity, noting the gaps or blind spots that regularly infiltrate research frameworks. Discussion focuses on ‘A Study in Pink’ – the first episode of Sherlock’s initial season – which producer Sue Vertue explains was actually “written and shot last, and so could make the best use of onscreen text as additional script and plot points” (qtd in McMillan, 2014).

Texting Sherlock

Figure 2

Figure 2: Watson reads a text message in ‘A Study in Pink’, Sherlock (2010), Episode 1, Season 1.

The phenomenon under investigation in this article is by no means easy to define. Already it has inspired neologisms, word mashes and acronyms including TELOP (television optical projection), ‘impact captioning’ (Sasamoto, 2014), ‘decotitles’ (Kofoed, 2011), ‘beyond screen text messaging’ (Zhang 2014) and ‘authorial titling’ (Pérez González, 2012). While slight differences in meaning separate such terms from one another, the on-screen text in Sherlock fits all. Hence, in this discussion, I alternate between them and often default to more general terms like ‘titling’ and ‘on-screen text’ for their wide applicability across viewing devices and subject matter. This approach preserves the terminological ambiguity that attaches to this phenomenon instead of seeking to solve it, finding it symptomatic of the rapid rate of technological development with which it engages. Whatever term is decided upon today could well be obsolete tomorrow. Additionally, as Rick Altman (2004: 16) notes in his ‘crisis historiography’ of silent and early sound film, the “apparently innocuous process of naming is actually one of culture’s most powerful forms of appropriation.” He argues that in the context of new technologies and the representational codes they engender, terminological variance and confusion signals an identity crisis “reflected in every aspect of the new technology’s socially defined existence” (19).

According to the write-ups, phone messaging is the hero of BBC’s updated and rebooted Sherlock adaptation. Almost all the press garnered around Sherlock’s on-screen text links this strategy to mobile phone ‘texting’ or SMS (short messaging service). Reporting on “the storytelling challenges of a world filled with unglamorous smartphones, texting and social media”, The Wall Street Journal’s Rachel Dodes (2013) credits Sherlock with solving this dilemma and establishing a new convention for depicting texting on the big screen, creatively capturing “the real world’s digital transformation of everyday life.” For Mariel Calloway (2013), “Sherlock is honest about the role of technology and social media in daily life and daily thought… the seamless way that text messages and internet searches integrate into our lives.” Wired’s Graeme McMillan (2014) ups the ante, naming Sherlock a “new take” on “television drama as a whole” due precisely to its on-screen texting technique that sets it apart from other “tech-savvy shows out there”. McMillan continues, that “as with so many aspects of Sherlock, there’s an element of misdirection going on here, with the fun, eye-catching slickness of the visualization distracting from a deeper commentary the show is making about its characters relationship with technology – and, by extension, our own relationship with it, as well.”

As this flurry of media attention makes clear, praise for Sherlock’s on-screen text or texting firmly anchors this strategy to technology and its newly evolving forms, most notably the iPhone or smartphone. Appearing consistently throughout the series’ three seasons to date, on-screen text in Sherlock occurs in a plain, uniform white sans-serif font that appears unadorned over the screen image, obviously added during post-production. This text is superimposed, pure and simple, relying on neither text bubbles nor coloured boxes nor sender ID’s to formally separate it from the rest of the image area. As Michele Tepper (2011) eloquently notes, by utilising text in this way, Sherlock “is capturing the viewer’s screen as part of the narrative itself”:

It’s a remarkably elegant solution from director Paul McGuigan. And it works because we, the viewing audience, have been trained to understand it by the last several years of service-driven, multi-platform, multi-screen applications. Last week’s iCloud announcement is just the latest iteration of what can happen when your data is in the cloud and can be accessed by a wide range of smart-enough devices. Your VOIP phone can show caller ID on your TV; your iPod can talk to both your car and your sneakers; Twitter is equally accessible via SMS or a desktop application. It doesn’t matter where or what the screen is, as long as it’s connected to a network device. … In this technological environment, the visual conceit that Sherlock’s text message could migrate from John Watson’s screen to ours makes complete and utter sense.

Unlike on-screen text in Glee (Fox, 2009–), for instance (see Fig. 3), that is used only occasionally in episodes like ‘Feud’ (Season 4, Ep 16, March 14, 2013), Sherlock flaunts its on-screen text as signature. Its consistently interesting textual play helps to give the series cohesion. Yet, just as it aids in characterisation, helps to progress the narrative, and binds the series as a whole, it also, necessarily, remains at somewhat of a remove, as an overtly post-production effect.

Figure 3

Figure 3: Ryder chats online in ‘Feud’, Glee (2013), Episode 16, Season 4.

While Tepper (2011) explains how Sherlock’s “disembodied” (Banks, 2014) texting ‘makes sense’ in the age of cross-platform devices and online clouds, this argument falters when the on-screen text in question is less overtly technological. The extradiegetic nature of this on-screen text – so obviously a ‘post’ effect – is brought to the fore when it is used to render thoughts and emotions rather than technological interfacing. In ‘A Study in Pink’, a large proportion of the text that pops up intermittently on-screen functions to represent Sherlock’s interiority, not his Internet prowess. In concert with camera angles and “microscopic close-ups”, it elucidates Sherlock’s forensic “mind’s eye” (Redmond, Sita and Vincs, this issue), highlighting clues and literally spelling out their significance (see Figs. 4 and 5). The fact that these human-coded moments of titling have received far less attention in the press than those that more directly index new technologies is fascinating in itself, revealing the degree to which praise for Sherlock’s on-screen text is invested in ideas of newness and technological innovation – underlined by the predilection for neologisms.

Figure 4

Figures 4: Sherlock examines the pink lady’s ring in ‘A Study in Pink’, Sherlock (2010), Episode 1, Season 1.

Figure 5

Figures 5: Sherlock examines the pink lady’s ring in ‘A Study in Pink’, Sherlock (2010), Episode 1, Season 1.

Of course, even when not attached to smartphones or data retrieval, Sherlock’s deployment of on-screen text remains fresh, creative and playful and still signals perceptual shifts resulting from technological transformation. Even when representing Sherlock’s thoughts, text flashes on screen manage to recall the excesses of the digital, when email, Facebook and Twitter ensconce us in streams of endlessly circulating words, and textual pop-ups are ubiquitous. Nevertheless, the blinkered way in which Sherlock’s on-screen text is repeatedly framed as, above all, a means of representing mobile phone texting functions to conceal some of its links to older, more conventional forms of titling and textual intervention, from silent-era intertitles to expository titles to subtitles. By relentlessly emphasising its newness, much discussion of Sherlock’s on-screen text overlooks links to a host of related past and present practices. Moreover, Sherlock’s textual play actually invites a rethinking of these older, ongoing text-on-screen devices.

Reading, Watching, Listening

As Szarkowska and Kruger (this issue) explain, research into subtitle processing builds upon earlier eye tracking studies on the reading of static, printed text. They proceed to detail differences between subtitle and ‘regular’ reading, in relation to factors like presentation speed, information redundancy, and sensory competition between different multimodal channels. Here, I focus on differences between saccadic or scanning movements and fixations, in order to compare data across the screen and translation fields. During ‘regular’ reading (of static texts) average saccades last 20 to 50 milliseconds (ms) while fixations range between 100 and 500ms, averaging 200 to 300ms (Rayner, 1998). Referencing pioneering studies into subtitle processing by Géry d’Ydewalle and associates, Szarkowska et al. (2013: 155) note that “when reading film subtitles, as opposed to print, viewers tend to make more regressions” and fixations tend to be shorter. Regressions occur when the eye returns to material that has already been read, and Rayner (1998: 393) finds that slower readers (of static text) make more regressions than faster readers. A study by d’Ydewalle and de Bruycker (2007: 202) found “the percentage of regressions in reading subtitles was globally, among children and adults, much higher than in normal text reading.” They also report that mean fixation durations in the subtitles was shorter, at 178 ms (for adults) and explain that subtitle regressions (where the eye travels back across words already read) can be partly explained by the “considerable information redundancy” that occurs when “[s]ubtitle, soundtrack (including the voice and additional information such as intonation, background noise, etc.), and image all provide partially overlapping information, eliciting back and forth shifts with the image and more regressive eye-movements” (202).

What happens to saccades and fixations when image processing is brought into the mix? When looking at static images, average fixations last 330 ms (Rayner, 1998). This figure is slightly longer than average fixations during regular reading and longer again than average subtitle fixations. Szarkowska and Kruger (this issue) note that “reading requires many successive fixations to extract information whereas looking at a scene requires fewer, but longer fixations” that tend to be more exploratory or ambient in nature, taking in a greater area of focus. In relation to moving-images, Smith (2013: 168) finds that viewers take in roughly 3.8% of the total screen area during an average length shot. Peripheral processing is at play but “is mostly reserved for selecting future saccade targets, tracking moving targets, and extracting gist about scene category, layout and vague object information”. In thinking about these differences in regular reading behaviour, screen viewing, and subtitle processing, it is noticeable that with subtitles, distinctions between fixations and saccades are less clear-cut. While saccades last between 20 and 50ms, Smith (2013: 169) notes that the smallest amount of time taken to perform a saccadic eye movement (taking into account saccadic reaction time) is 100-130ms. Recalling d’Ydewalle and de Bruycker’s (2007: 202) finding that fixations during subtitle processing last around 178ms, it would seem that subtitle conditions blur the boundaries somewhat between saccades and fixations, scanning and reading.

Interestingly, studies have also shown that the processing of two-line subtitles involves more regular word-by-word reading than for one-liners (D’Ydewalle and de Bruycker, 2007: 199). D’Ydewalle and de Bruycker (2007: 199) report, for instance, that more words are skipped and more regressions occur for one-line subtitles than for two-line subtitles. Two-line subtitles result in a larger proportion of time being spent in the subtitle area, and occasion more back-and-forth shifts between the subtitles and the remaining image area (201). This finding suggests that the processing of one-line subtitles differs considerably from regular reading behaviour. D’Ydewalle and de Bruycker (2007: 202) surmise that the distinct way in which one-line subtitles are processed relates to a redundancy effect caused by the multimodal nature of screen media. Noting how one-line subtitles often convey short exclamations and outcries, they suggest that a “standard one-line subtitle generally does not provide much more information than what can already be extracted from the picture and the auditory message.” They conclude that one-line subtitles occasion “less reading” than two-line subtitles (202). Extrapolating further, I posit that the routine overlapping of information that occurs in subtitled screen media blurs lines between reading and watching. One-line subtitles are ‘read’ irregularly and partly blind – that is, they are regularly skipped and processed through saccadic eye movements rather than fixations.

This suggestion is supported by data on subtitle skipping. Szarkowska and Kruger (this issue) find that longer subtitles containing frequently used words are easier and quicker to process than shorter subtitles containing low-frequency words. Hence, they conclude that cognitive load relates more to word familiarity than quantity, something that is overlooked in many professional subtitling guidelines. This finding indicates that high-frequency words are processed ‘differently’ in subtitling than in static text, in a manner more akin to visual recognition or scanning than reading. Szarkowska and Kruger find that high-frequency words in subtitles are often skipped. Hence, as with one-line subtitles, high-frequency words are, to a degree, processed blind, possibly through shape recognition and mapping more than durational focus. In relation to other types of on-screen text, such as the short, free-floating type that characterises Sherlock, it seems entirely possible that this innovative mode of titling may just challenge distinctions between text and image processing. While commentators laud this series for the way it integrates on-screen text into its narrative, style and characterisation, eye tracking is required to unpack the cognitive implications of Sherlock’s text/image morph.

The Pink Lady

Figure 6

Figure 6: Letters scratched into the floor in ‘A Study in Pink’, Sherlock (2010), Episode 1, Season 1.

Sherlock producer Vertue refers to the pink lady scene in ‘A Study in Pink’ as particularly noteworthy for its “text all around the screen”, referring to it as the “best use” of on-screen text in the series (qtd in McMillan, 2014). In this scene, a dead woman dressed in pink lies face first on the floor of a derelict building into which she has painstakingly etched a word or series of letters (‘Rache’) with her fingernails. As Sherlock investigates the crime scene, forensics officer Anderson interrupts to explain that ‘Rache’ is the German word for ‘revenge’. The German-to-English translation pops up on screen (see Fig. 6), and this time Sherlock sees it too. This superimposed text, so obviously laid over the image, oversteps its surface positioning to enter Sherlock’s diegetic space, and we next view it backwards, from Sherlock’s point of view, not ours (see Fig. 7). After an exasperated eye roll that signals his disregard for Anderson, Sherlock dismisses this textual intervention and we watch it swirl into oblivion. Here, on-screen text is at once both inside and outside the narrative, diegetic and extra-diegetic, informative and affecting. In this way it self-reflexively draws attention to the show’s narrative framing, demonstrating its complexity as distinct diegetic levels merge.

Figure 7

Figure 7: Sherlock sees on-screen text in ‘A Study in Pink’, Sherlock (2010), Episode 1, Season 1.

For Carol O’Sullivan (2011), when on-screen text affords this type of play between the diegetic and extra-diegetic it functions as an “extreme anti-naturalistic device” (166) that she unpacks via Gérard Genette’s notion of narrative metalepsis (164). Detailing numerous examples of humourous, formally transgressive diegetic subtitles, such as those found in Annie Hall (Woody Allen, 1977) (Fig. 8), O’Sullivan points to their metatextual function, referring to them as “metasubtitles” (166) that implicitly comment on the limits and nature of subtitling itself. When Sherlock’s on-screen titles oscillate between character and viewer point-of-view shots, they too become metatextual, demonstrating, in Genette’s terms, “the importance of the boundary they tax their ingenuity to overstep in defiance of verisimilitude – a boundary that is precisely the narrating (or the performance) itself: a shifting but sacred frontier between two worlds, the world in which one tells, the world of which one tells” (qtd in O’Sullivan 2011: 165). Moreover, for O’Sullivan, “all subtitles are metatextual” (166) necessarily foregrounding their own act of mediation and interpretation. Specifically linking such ideas to Sherlock, Luis Perez Gonzalez (2012: 18) notes how “the series creators incorporate titles that draw attention to the material apparatus of filmic production”, thereby creating an complex alienation-attraction effect “that shapes audience engagement by commenting upon the diegetic action and disrupting conventional forms of semiotic representation, making viewers consciously work as co-creators of media content.”

Figure 8

Figure 8: Subtitled thoughts in the balcony scene, Annie Hall (1977).

Eye Bias

One finding from subtitle eye tracking research particularly pertinent to Sherlock is the notion that on-screen text causes eye bias. This was established in various studies conducted by d’Ydewalle and associates, which found that subtitle processing is largely automatic and obligatory. D’Ydewalle and de Bruycker (2007: 196) state:

Paying attention to the subtitle at its presentation onset is more or less obligatory and is unaffected by major contextual factors such as the availability of the soundtrack, knowledge of the foreign language in the soundtrack, and important episodic characteristics of actions in the movie: Switching attention from the visual image to “reading” the subtitles happens effortlessly and almost automatically (196).

This point is confirmed by Bisson et al. (2014: 399) who report that participants read subtitles even in ‘reversed’ conditions – that is, when subtitles are rendered in an unfamiliar language and the screen audio is fully comprehensible (in the viewers’ first language) (413). Again, in intralingual or same-language subtitling – when titles replicate the language spoken on screen –hearing audiences still divert to the subtitle area (413). These findings indicate that viewers track subtitles irrespective of language or accessibility requirements. In fact, the tracking of subtitles overrides function. As Bisson et al. (413) surmise, “the dynamic nature of the subtitles, i.e., the appearance and disappearance of the subtitles on the screen, coupled with the fact that the subtitles contained words was enough to generate reading behavior”.

Szarkowska and Kruger (this issue) reach a similar conclusion, explaining eye bias towards subtitles in terms of both bottom-up and top-down impulses. When subtitles or other forms of text flash up on screen, they affect a change to the scene that automatically pulls our eyes. The appearance and disappearance of text on screen is registered in terms of motion contrast, which according to Smith (2013: 176), is the “critical component predicting gaze behavior”, attaching to small movements as well as big. Additionally, we are drawn to words on screen because we identify them as a ready source of relevant information, as found in Batty et al. (forthcoming). Analysing a dialogue-free montage sequence from animated feature Up (Pete Docter, 2009), Batty et al. found that on-screen text in the form of signage replicates in miniature how ‘classical’ montage functions as a condensed form of storytelling aiming for enhanced communication and exposition. They suggest that montage offers a rhetorical amplification of an implicit intertitle, thereby alluding to the historical roots of text on screen while underlining its narrative as well as visual salience. One frame from the montage sequence focuses in close-up on a basket containing picnic items and airline tickets (see Fig. 9). Eye tracking tests conducted on twelve participants indicates a high degree of attentional synchrony in relation to the text elements of the airline ticket on which Ellie’s name is printed. Here, text provides a highly expedient visual clue as to the narrative significance of the scene and viewers are drawn to it precisely for its intertitle-like, expository function, highlighting the top-down impulse also at play in the eye bias caused by on-screen text.

Figure 9

Figure 9: Heat map showing collective gaze weightings during the montage sequence in Up (2009).

In this image from Up, printed text appears in the centre of the frame and, as Smith (2013: 178) elucidates, eyes are instinctively drawn towards frame centre, a finding backed up by much subtitle research (see Skarkowska and Kruger, this issue). However, eye tracking results on Sherlock conducted by Redmond, Sita and Vincs (this issue) indicate that viewers also scan static text when it is not in the centre of the frame. In an establishing shot of 221B Baker Street from the first episode of Sherlock’s second season, ‘A Scandal in Belgravia’, viewers track static text that borders the frame across its top and right hand sides, again searching for information (See Fig. 10). Hence, the eye-pull exerted by text is noticeable even in the absence of movement, contrast and central framing. In part, viewers are attracted to text simply because it is text – identified as an efficient communication mode that facilitates speedy comprehension (see Lavaur, 2011: 457).

Figure 10

Figure 10: Single viewer gaze path for ‘A Scandal in Belgravia’, Sherlock (2012), Episode 1, Season 2.


What do these eye tracking results across screen and translation studies tell us about Sherlock’s innovative use of on-screen text and texting? Based on the notion that text on screen draws the eye in at least dual ways, due to both its dynamic/contrastive nature and its communicative expediency, we can surmise that for Sherlock viewers, on-screen text is highly visible and more than likely to be in that 3.8% of the screen on which they will focus at any one point in time (see Smith, 2013: 168). The marked eye bias caused by text on screen is further accentuated in Sherlock by the freshness of its textual flashes, especially for English-speaking audiences given the language hierarchies of global screen media (see Acland 2012, UNESCO 2013). The small percentage of foreign-language media imported into most English-speaking markets tends to result in a lack of familiarity with subtitling beyond niche audience segments. For those unfamiliar with subtitling or captioning, on-screen text appears particularly novel. Additionally, as explored, floating TELOPs in Sherlock attract attention due to the complex functions they fulfil, providing narrative and character clues as well as textual and stylistic cohesion. As Tepper (2011) points out, in the first episode of the series, viewers are introduced to Sherlock’s character via text, before seeing him on screen. “When he texts the word ‘Wrong!’ to DI Lestrade and all the reporters at Lestrade’s press conference,” notes Tepper, “the technological savvy and the imperiousness of tone tell you most of what you need to know about the character.”

There seems no doubt that on-screen text in Sherlock attracts eye movement, and that it therefore distracts from other parts of the image. One question then that immediately presents itself is why Sherlock’s textual distractions are tolerated – even celebrated – to a far greater extent than other, more conventional or routine forms of titling like subtitles and captions. While Sherlock’s on-screen text is praised as innovative and incisive, interlingual subtitling and SDH are criticised by detractors for the way in which they supposedly force viewers to read rather than watch, effectively transforming film into “a kind of high-class comic book with sound effects” (Canby, 1983).[2] Certainly, differences in scale affect such attitudes and the quantitative variance between post-subtitles (produced for distribution only) and authorial or diegetic titling (as seen in Sherlock) is pronounced.[3] However, eye tracking research on subtitle processing indicates that, on the whole, viewers easily accommodate the increased cognitive load it presents. Although attentional splitting occurs, leading to an increase in back-and-forth shifts between the subtitles and the rest of the image area (Skarkowska and Kruger, this issue), viewers acclimatise by making shorter fixations than in regular reading and by skipping high-frequency words and subtitles while still managing to register meaning (see d’Ydewalle and de Bruycker, 2007: 199). In this way, subtitle processing reveals many differences to reading of static text, and approximates techniques of visual scanning. Bearing these findings in mind, I propose it is more accurate to see subtitling as transforming reading into viewing and text into image, rather than vice versa.

Situating Sherlock in relation to a range of related TELOP practices across diverse TV genres (such as game shows, panel shows, news broadcasting and dramas) Ryoko Sasamoto (2014: 7) notes that the additional processing effort caused by on-screen text is offset by its editorial function.[4] TELOPs are often deployed by TV producers to guide interpretation and ensure comprehension by selecting and highlighting information deemed most relevant. This suggestion is backed up by research by Rei Matsukawa et al. (2009), which found that the information redundancy effect caused by TELOPs facilitates understanding of TV news. For Sasamoto (2014: 7), ‘impact captioning’ highlights salient information in much the same way as voice intonation or contrastive stress. It acts as a “written prop on screen” enabling “TV producers to achieve their communicative aims… in a highly economical manner” (8). Focusing on Sherlock specifically, Sasamoto suggests that its captioning provides “a route for viewers into complex narratives” (9). Moreover, as Szarkowska and Kruger (this issue) note, in static reading conditions, “longer fixations typically reflect higher cognitive load.” Consequently, the shorter fixations that characterise subtitle viewing supports the contention that on-screen text processing is eased by its expedient, editorial function and by redundancy effects resulting from its multimodality.

Switched On

Another way in which Sherlock’s text and titling innovations extend beyond mobile phone usage was exemplified in July 2013 by a promotional campaign that promised viewers a ‘sneak peak’ at a yet-to-be-released episode title, requiring them to find and piece together a series of clues. In true Sherlockian style, the clues were well hidden, only visible to viewers if they switched on closed-captioning or SDH available for deaf and hard-of-hearing audiences. With this device turned on, viewers encountered intralingual captioning along the bottom of their screen and additionally, individually boxed letters that appeared top left (see Figs. 11 and 12). Viewers needed to gather all these single letter clues in order to deduce the episode title: ‘His Last Vow’. According to the ‘I Heart Subtitles’ blog (July 16, 2013), in doing so, Sherlock once again displayed its ability to “think outside the box and consider all…options”. It also cemented its commitment to on-screen text in various guises, and effectively gave voice to an audience segment typically disregarded in screen commentary and analysis. Through this highly unusual, cryptic campaign, Sherlock alerted viewers to more overtly functional forms of titling, and intimated points of connection between language, textual intervention and access.

Figure 11

Figures 11: Boxed letter clues (top left of frame) that appeared when closed captioning was switched on, during a re-run of ‘A Scandal in Belgravia’, Sherlock (2012), Episode 1, Season 2.

Figure 12

Figures 12: Boxed letter clues (top left of frame) that appeared when closed captioning was switched on, during a re-run of ‘A Scandal in Belgravia’, Sherlock (2012), Episode 1, Season 2.


On-screen text invites a rethinking of the visual, expanding its borders and blurring its definitional clarity. Eye tracking research demonstrates that moving text on screens is processed differently to static text, affected by a range of factors issuing from its multimodal complexity. Sherlock subtly signals such issues through its playful, irreverent deployment of text, which enables viewers to directly access Sherlock’s thoughts and understand his reasoning, while also distancing them, asking them to marvel at his ‘millennial’ technological prowess (Stein and Busse, 2012: 11) while remaining self-consciously aware of his complex narrative framing as it flips inside out, inviting audiences to watch themselves watching. Such diegetic transgression is yet to be mapped through eye tracking, intimating a profitable direction for future studies. To date, data on text and image processing demonstrates how on-screen text attracts eye movement and hence, it can be inferred that it distracts from other parts of the image area. Yet, despite rendering more of the image effectively ‘invisible’, text in the form of TELOPs are increasingly prevalent in news broadcasts, current affairs panel shows (when audience text messages are displayed) and, most notably, in Asian TV genres where they are now a “standard editorial prop” featured in many dramas and game shows (Sasamoto, 2014: 1). In order to take up the challenge presented by such emerging modes of screen address, research needs to move beyond surface assessments of the attraction/distraction nexus. It is the very attraction to TELOP distraction that Sherlock – via eye tracking – brings to the fore.



Acland, Charles. 2012. “From International Blockbusters to National Hits: Analysis of the 2010 UIS Survey on Feature Film Statistics.” UIS Information Bulletin 8: 1-24. UNESCO Institute for Statistics.

Altman, Rick. 2004. Silent Film Sound. New York: Columbia University Press.

Banks, David. 2012. “Sherlock: A Perspective on Technology and Story Telling.” Cyborgology, January 25. Accessed October 9, 2014.

Batty, Craig, Adrian Dyer, Claire Perkins and Jodi Sita (forthcoming). “Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative.” In Making Sense of Cinema: Empirical Studies into Film Spectators and Spectatorship, edited by Carrie Lynn D. Reinhard and Christopher J. Olson. London and New York: Bloomsbury.

Bennet, Alannah. 2014. “From Sherlock to House of Cards: Who’s Figured Out How to Translate Texting to Film.” Bustle, August 18. Entertainment. Accessed October 9.

Biedenharn, Isabella. 2014. “A Brief Visual History of On-Screen Text Messages in Movies and TV.Flavorwire, April 24. Accessed October 13.

Bisson, Marie-Jos´ee, Walter J. B. Van Heuven, Kathy Conklin And Richard J. Tunney. 2014. “Processing of native and foreign language subtitles in films: An eye tracking study.” Applied Psycholinguistics 35: 399–418. Accessed October 13, 2014. doi: 10.1017/S0142716412000434.

Calloway, Mariel. 2013. “The Game is On(line): BBC’s ‘Sherlock’ in the Age of Social MediaMariel Calloway, March 8. Accessed October 14, 2014.

Canby, Vincent. 1983. “A Rebel Lion Breaks Out.” New York Times, March 27, 21.

Dodes, Rachel. 2013. “From Talkies to Texties.” Wall Street Journal, April 4, Arts and Entertainment Section. Accessed October 13, 2014.

d’Ydewalle, Géry and Wim De Bruycker, 2007. “Eye movements of children and adults while reading television subtitles.” European Psychologist 12 (3): 196-205.

Kofoed, D. T. 2011. “Decotitles, the Animated Discourse of Fox’s Recent Anglophonic Internationalism.” Reconstruction 11 (1). Accessed October 5, 2012.

Lavaur, Jean-Marc and Dominic Bairstow. 2011. “Languages on the screen: Is film comprehension related to the viewers’ fluency level and to the language in the subtitles?” International Journal of Psychology 46 (6): 455-462. doi: 10.1080/00207594.2011.565343.

McMillan, Graeme. 2014. “Sherlock’s Text Messages Reveal Our TranshumanismWired UK, February 3. Accessed October 14.

Matsukawa, Rei, Yosuke Miyata and Shuichi Ueda. 2009. “Information Redundancy Effect on Watching TV News: Analysis of Eye Tracking Data and Examination of the Contents.” Literary and Information Science 62: 193-205.

O’Sullivan, Carol. 2011. Translating Popular Film. Basingstoke and New York: Palgrave Macmillan.

Pérez González, Luis. 2013. “Co-Creational Subtitling in the Digital Media: Transformative and Authorial Practices.” International Journal of Cultural Studies 16 (1): 3-21. Accessed September 25, 2014. doi: 10.1177/1367877912459145.

Rayner, K. 1998. “Eye Movements in Reading and Information Processing: 20 Years of Research.” Psychological Bulletin 124: 372-422.

Redmond, Sean, Jodi Sita and Kim Vincs. 2015. “Our Sherlockian Eyes: The Surveillance of VisionRefractory: a Journal of Entertainment Media, 25.

Romero-Fresco, Pablo. 2013. “Accessible filmmaking: Joining the dots between audiovisual translation, accessibility and filmmaking.” JoSTrans: The Journal of Specialised Translation 20: 201-23. Accessed September 20, 2014.

Sasamoto, Ryoko. 2014. “Impact caption as a highlighting device: Attempts at viewer manipulation on TV.” Discourse, Context and Media 6: 1-10. Accessed September 18 (Article in Press). doi: 10.1016/j.dcm.2014.03.003.

Schrodt, Paul. 2013. “This is How to Shoot Text MessagingEsquire, February 4. The Culture Blog. Accessed October 13, 2014.

Smith, Tim J. 2013. “Watching You Watch Movies: Using Eye Tracking to Inform

Cognitive Film Theory” in Psychocinematics: Exploring Cognition at the Movies, edited by Arthur P. Shimamura, 165-91. Oxford and New York: Oxford University Press. Accessed October 7, 2014. doi:

Stein, Louise Ellen and Kristina Busse. 2012. “Introduction: The Literary, Televisual and Digital Adventures of the Beloved Detective.” In Sherlock and Transmedia Fandom: Essays on the BBC Series, edited by Louise Ellen Stein and Kristina Busse, 9-24. Jefferson: McFarland and Company.

Szarkowska, Agnieszka et. al. 2013. “Harnessing the Potential of Eye-Tracking for Media Accessibility.” in Translation Studies and Eye-Tracking Analysis, edited by Sambor Grucza, Monika Płużyczka and Justyna Zając, 153-83. Frankfurt am Mein: Peter Lang.

Szarkowska, Agnieszka and Jan Louis Kruger. 2015. “Subtitles on the Moving Image: An Overview of Eye Tracking Studies.” Refractory: a Journal of Entertainment Media, 25.

Tepper, Michele. 2011. “The Case of the Travelling Text Message.” Interactions Everywhere, June 14. Accessed October 14, 2014.

UNESCO. 2013. “Feature Film Diversity”, UIS Fact Sheet 24, May. Accessed October 3, 2014.

Zhang, Sarah. 2014. “How Hollywood Figured Out A Way To Make Texting In Movies Look Less Dumb.Gizmodo, August 18. Accessed August 19.

Zhou, Tony. 2014. “A Brief Look at Texting and the Internet in Film”. Video Essay, Every Frame a Painting, August 15. Accessed August 19.


List of Figures




[1] While some commentators point out that Sherlock was by no means the first to depict text messaging in this way – as floating text on screen – it is this series more than any other that has brought this phenomenon into the limelight. Other notable uses of on-screen text to depict mobile phone messaging occur in films All About Lily Chou-Chou (Iwai, 2001), Disconnect (Rubin, 2013), The Fault in Our Stars (Boone, 2014), LOL (Azuelos, 2012), Non-Stop (Collet-Serra, 2014), Wall Street: Money Never Sleeps (Stone, 2010), and in TV series Glee (Fox, 2009–), House of Cards (Netflix, 2013–), Hollyoaks (Channel 4, 1995–), Married Single Other (ITV, 2010) and Slide (Fox8, 2011). For discussion of some ‘early adopters’, see Biendenharn 2014.



[2] Notably, in this New York Times piece, Canby (1983) actually defends subtitling against this charge, and advocates for subtitling over dubbing.

[3] On distinctions between post-subtitling and pre-subtitling (including diegetic subtitling), see O’Sullivan (2011).

[4] According to Sasamoto (2014: 1), “the use of OCT [Open Caption Telop] as an aid for enhanced viewing experience originated in Japan in 1990.”



Dr Tessa Dwyer teaches Screen Studies at the University of Melbourne, specialising in language politics and issues of screen translation. Her publications have appeared in journals such as The Velvet Light Trap, The Translator and The South Atlantic Quarterly and in a range of anthologies including B is for Bad Cinema (2014), Words, Images and Performances in Translation (2012) and the forthcoming Locating the Voice in Film (2016), Contemporary Publics (2016) and the Routledge Handbook of Audiovisual Translation (2017). In 2008, she co-edited a special issue of Refractory on split screens. She is a member of the ETMI research group and is currently writing a book on error and screen translation.

Subtitles on the Moving Image: an Overview of Eye Tracking Studies – Jan Louis Kruger, Agnieszka Szarkowska and Izabela Krejtz


This article provides an overview of eye tracking studies on subtitling (also known as captioning), and makes recommendations for future cognitive research in the field of audiovisual translation (AVT). We find that most studies in the field that have been conducted to date fail to address the actual processing of verbal information contained in subtitles, and rather focus on the impact of subtitles on viewing behaviour. We also show how eye tracking can be utilised to measure not only the reading of subtitles, but also the impact of stylistic elements such as language usage and technical issues such as the presence of subtitles during shot changes on the cognitive processing of the audiovisual text as a whole. We support our overview with empirical evidence from various eye tracking studies conducted on a number of languages, language combinations, viewing contexts as well as different types of viewers/readers, such as hearing, hard of hearing and Deaf people.


The reading of printed text has received substantial attention from scholars since the 1970s (for an overview of the first two decades see Rayner et al. 1998). Many of these studies, conducted from a psycholinguistic angle, made use of eye tracking. As a result, a large body of knowledge exists on the eye movements during reading of people with varying levels of reading skills and language proficiency, with a range of ages, different first languages and cultural backgrounds, and in different contexts. Studies on subtitle reading, however, have not achieved the same level of scientific rigour largely for practical reasons: subtitles are not static for more than a few seconds at a time; they compete for visual attention with a moving image; and they compete for overall cognitive resources with verbal and non-verbal sounds. This article will identify some of the gaps in current research in the field, and also illustrate how some of these gaps can be bridged.

Studying the reading of subtitles is significantly different from studying the reading of static text. In the first place, as far as eye tracking software is concerned, the subtitles appear on a moving image as image rather than text, which renders traditional text-based reading statistics and software all but useless. This also makes the collection of data for reading research on subtitles a painstakingly slow process involving substantial manual inspections and coding. Secondly, the fact that subtitles appear against the background of the moving image means that they are always in competition with this image, which renders the reading process fundamentally different from the reading process of static texts: on the one hand because the reading of subtitles compete with the processing of the image, sometimes resulting in interrupted reading, but on the other hand the limited time the subtitles are on screen means that readers have less time to reread or regress to study difficult words or to check information. Either way, studying this reading process, and the cognitive processing that takes place during the reading, is much more complicated than in the case of static texts where we know that the reader is mainly focussing on the words before her/him without additional auditory and visual information to process.

While the viewing of subtitles has been the object of many eye tracking studies in recent years, with increasing frequency (see, for example Bisson et al. 2012; d’Ydewalle and Gielen 1992; d’Ydewalle and De Bruycker 2007; Ghia 2012; Krejtz et al. 2013; Kruger 2013; Kruger et al. 2013; Kruger and Steyn 2014; Perego et al. 2010; Rajendran et al. 2013; Specker 2008; Szarkowska et al. 2011; Winke et al. 2013), the study of the reading of subtitles remains a largely uncharted territory with many research avenues still to be explored. Those studies that do venture to measure more than just attention to the subtitle area, seldom do this for extended texts.

In this article we provide an overview of studies on how subtitles change the way viewers process audiovisual material, and also of studies on the unique characteristics of the subtitle reading process. Taking an analysis of the differences between reading printed (static) text and subtitles as point of departure, we examine a number of aspects typical of the way subtitle text is processed in reading. We also look at the impact of the dynamic nature of the text and the competition with other sources of information on the reading process (including scene perception, changes in the viewing process, shifts between subtitles and image, visual saliency of text, faces, and movement, and cognitive load), as well as discussing studies on the impact of graphic elements on subtitle reading (e.g. number of lines, and text chunking), and studies that attempt to measure the subtitle reading process in more detail.

We start off with a discussion of the way in which watching an audiovisual text with subtitles alters viewing behaviour as well as of the complexities of studying subtitles due to the dynamic nature of the image it has as a backdrop. Here we focus on the fleeting nature of the subtitle text, the competition between reading the subtitles and scanning the image, and the interaction between different sources of information. We further discuss internal factors that impact on subtitle processing, like the language and culture of the audience, the language of the subtitles, the degree of access the audience has to sound, and other internal factors, before turning to external factors related to the nature of the audiovisual text and the presentation of the subtitles. Finally, we provide an overview of studies attempting to measure the processing of subtitles as well as findings from two studies that approach the processing of subtitles

The dynamic nature of the subtitle reading process

Reading subtitles differs substantially from reading printed text in a number of respects. As opposed to “static text on a stable background”, the viewer of subtitled audiovisual material is confronted with “fleeting text on a dynamic background” (Kruger and Steyn 2014, 105). In consequence, viewers not only need to process and integrate information from different communication channels (verbal visual, non-verbal visual, verbal auditory, non-verbal auditory, see Gottlieb 1998), but they also have no control over the presentation speed (see Kruger and Steyn 2014; Szarkowska et al. forthcoming). As a consequence, unlike in the reading of static texts, the pace of reading is in part dictated by the text rather than the reader – by the time the text is available to be read – and there is much less time for the reader to regress to an earlier part of a sentence or phrase, and no opportunity to return to previous sentences. Reading traditionally takes place in a limited window which the reader is acutely aware will disappear in a few seconds. Even though there are exceptions to the level of control a viewer has, for example in the case of DVD and PVR as well as other electronic media where the viewer can rewind and forward at will, the typical viewing of subtitles for most audiovisual products happens continuously and without pauses just as when watching live television.

Regressions, which form an important consideration in the reading of static text, take on a different aspect in the context of the knowledge (the viewer has) that dwelling too much on any part of a subtitle may make it difficult to finish reading the subtitle before it disappears. Any subtitle is on screen for between one and six seconds, and the viewer also has to simultaneously process all the other auditory (in the case of hearing audiences) and visual cues. In other words, unlike when reading printed text, reading becomes only one of the cognitive processes the viewer has to juggle in order to understand the audiovisual text as a whole. Some regressions are in fact triggered by the change of the image in shot changes (and to a much lesser extent scene changes) when the text stays on across these boundaries, which means that the viewer sometimes returns to the beginning of the subtitle to check whether it is a new subtitle, and sometimes even re-reads the subtitle. For example, in a recent study, Krejtz et al. (2013) established that participants tend not to re-read subtitles after a shot change or cut. But their data also revealed that a proportion of the participants did return their gaze to the beginning of the subtitle after such a change (see also De Linde and Kay, 1999). What this means for the study of subtitle reading is that these momentary returns (even if only for checking) result in a class of regressions that is not in fact a regression to re-read a word or section, but rather a false initiation of reading for what some viewers initially perceive to be a new sentence.

On the positive side, the fact that subtitles are embedded on a moving image and are accompanied by a soundtrack (in the case of hearing audiences) facilitates the processing of language in context. Unfortunately, this context also introduces competition for attention and cognitive resources. For the Deaf and hard of hearing audience, attention has to be divided between reading the subtitles and processing the scene, extracting information from facial expressions, lip movements and gestures, and matching or checking this against the information obtained in the subtitles. For the hearing audience who makes use of subtitles for support or to provide access to foreign language dialogue, attention is likewise divided between subtitles and the visual scene, and just as the Deaf and hard of hearing audiences have the added demand on their cognitive resources of having to match what they read with what they get from non-verbal signs and lip movements, the hearing audience matches what they read with what they hear, checking for correspondence of information and interpreting intonation, tenor and other non-verbal elements of speech.

What stands beyond doubt is that the appearance of subtitles changes the viewing process. In 2000, Jensema et al. famously stated that “the addition of captions to a video resulted in major changes in eye movement patterns, with the viewing process becoming primarily a reading process” (2000a, 275). Having examined the eye movements of six subjects watching video clips with and without subtitles, they found that the onset of a subtitle triggers a change in the eye movement pattern: when a subtitle appears, viewers move their gaze from whatever they were watching in order to follow the subtitle. In a more wide-scale study it was concluded by d’Ydewalle and de Bruycker (2007,196) that “paying attention to the subtitle at its presentation onset is more or less obligatory and is unaffected by major contextual factors such as the availability of the soundtrack, knowledge of the foreign language in the soundtrack, and important episodic characteristics of actions in the movie: Switching attention from the visual image to “reading” the subtitles happens effortlessly and almost automatically”.

Subtitles therefore appear to be the cause of eye movement bias similar to faces (see Hershler & Hochstein, 2005; Langton, Law, Burton, & Schweinberger, 2008; Yarbus, 1967), the centre of the screen, contrast and movement. In other words, subtitles attract the gaze at least in part because of the fact that the eye is drawn to the words on screen just as the eye is drawn to movement and other elements. Eyes are drawn to subtitles not only because the text is identified as a source of meaningful information (in other words a top-down impulse as the viewer consciously consults the subtitles to obtain relevant information), but also because of the change to the scene that the appearance of a subtitle causes (in other words a bottom-up impulse, automatically drawing the eyes to what has changed on the screen).

As in most other contexts, the degree to which viewers will process the subtitles (i.e. read them rather than merely look at them when they appear and then look away) will be determined by the extent to which they need the subtitles to follow the dialogue or to obtain information on relevant sounds. In studying visual attention to subtitles it therefore remains a priority to measure the degree of processing, something that has not been done in more than a handful of studies, and something to which we will return later in the article.

Viewers usually attend to the image on the screen, but when subtitles appear, it only takes a few frames for most viewers to move their gaze to read the subtitles. The fact that people tend to move their gaze to subtitles the moment they appear on the screen is illustrated in Figures 1 and 2.

Figure. 1 Heat maps of three consecutive film stills – Polish news programme Fakty (TVN) with intralingual subtitles.

Figure. 1 Heat maps of three consecutive film stills – Polish news programme Fakty (TVN) with intralingual subtitles.

Figure 2. Heat maps of two consecutive film stills – Polish news programme Wiadomości (TVP1) with intralingual subtitles

Figure 2. Heat maps of two consecutive film stills – Polish news programme Wiadomości (TVP1) with intralingual subtitles

Likewise, when the gaze of a group of viewers watching an audiovisual text without subtitles is compared to that of a similar group watching the same text with subtitles, the split in attention is immediately visible as the second group reads the subtitles and attends less to the image, as can be seen in Figure 3.

Figure 3. Heat maps of the same scene seen without subtitles and with subtitles – recording of an academic lecture.

Figure 3. Heat maps of the same scene seen without subtitles and with subtitles – recording of an academic lecture.

Viewer-internal factors that impact on subtitle processing

The degree to which the subtitles are processed is far from straightforward. In a study performed at a South African university in the context of Sesotho students looking at a recorded lecture with subtitles in their first language and audio in English (their language of instruction), students were found to avoid looking at the subtitles (see Kruger, Hefer and Matthew, 2013b). Sesotho students in a different group who saw the same lecture with English subtitles processed the subtitles to a much larger extent. This contrast is illustrated in the focus maps in Figures 4.


Figure 4. Focus maps of Sesotho students looking at a lecture in intralingual English subtitles (left) and another group looking at the same lecture with interlingual Sesotho subtitles (right) – recording of an academic lecture.

The difference in eye movement behaviour between the conditions is also evident when considering the number of subtitles skipped. Participants in the above study who saw the video with Sesotho subtitles skipped an average of around 50% of the Sesotho subtitles (median at around 58%), whereas participants who saw the video with English subtitles only skipped an average of around 20% of the English subtitles (with a median of around 8%) (see Kruger, Hefer & Matthew, 2014).

This example does not, however, represent the conventional use of subtitles where viewers would rely on the subtitles to gain access to a text from which they would have been excluded without the subtitles. It does serve to illustrate that subtitle reading is not unproblematic and that more research is needed on the nature of processing in different contexts by different audiences. For example, in a study in Poland, interlingual subtitles (English to Polish) were skipped slightly less often by hearing viewers compared to intralingual subtitles (Polish to Polish), possibly because hearing viewers didn’t need them to follow the plot (see Szarkowska et al., forthcoming).

Another important finding from eye tracking studies on the subtitle process relates to how viewers typically go about reading a subtitle. Jensema et al. (2000) found that in subtitled videos, “there appears to be a general tendency to start by looking at the middle of the screen and then moving the gaze to the beginning of a caption within a fraction of a second. Viewers read the caption and then glance at the video action after they finish reading” (2000, 284). This pattern is indeed often found, as illustrated in the sequence of frames from a short video from our study in Figure 5.

Figure 5. Sequence of typical subtitle reading – a recording of Polish news programme Fakty (TVN) with intralingual subtitles.

Figure 5. Sequence of typical subtitle reading – a recording of Polish news programme Fakty (TVN) with intralingual subtitles.

Some viewers, however, do not read so smoothly and tend to shift their gaze between the image and the subtitles, as demonstrated in Figure 6. The gaze shifts between the image and the subtitle, also referred to in literature as ‘deflections’ (de Linde and Kay 1999) or ‘back-and-forth shifts’ (d’Ydewalle and De Bruycker (2007), can be regarded as an indication of the smoothness of the subtitle reading process: the fewer the gaze shifts, the more fluent the reading and vice versa.

Figure 6. Scanpath of frequent gaze shifting between text and image – a recording of Polish news programme Fakty (TVN) with intralingual subtitles.

Figure 6. Scanpath of frequent gaze shifting between text and image – a recording of Polish news programme Fakty (TVN) with intralingual subtitles.

An important factor that influences subtitle reading patterns is the nature of the audience. In Figure 7 an interesting difference is shown between the way a Deaf and a hard of hearing viewer watched a subtitled video. The Deaf viewer moved her gaze from the centre of the screen to read the subtitle and then, after having read the subtitle, returned the gaze to the centre of the screen. In contrast, the hard of hearing viewer made constant comparisons between the subtitles and the image, possibly relying on residual hearing and trying to support the subtitle reading process with lip-reading. Such a result was reported by Szarkowska et al. (2011), who found differences in the number of gaze shifts between the subtitles and the image in the verbatim subtitles condition, particularly discernible (and statistically significant) in the hard of hearing group (when compared to the hearing and Deaf groups).

Figure 7. Scanpaths of Deaf and hard of hearing viewers. Left: Gaze plot illustrating the viewing pattern of a Deaf participant watching a clip with verbatim subtitles.  Right: Gaze plot illustrating the viewing pattern of a hard of hearing participant watching a clip with verbatim subtitles.

Figure 7. Scanpaths of Deaf and hard of hearing viewers. Left: Gaze plot illustrating the viewing pattern of a Deaf participant watching a clip with verbatim subtitles. Right: Gaze plot illustrating the viewing pattern of a hard of hearing participant watching a clip with verbatim subtitles.

These provisional qualitative indications of differences between eye movements of users with different profiles require more in-depth quantitative investigation and the subsequent section will provide a few steps in this direction.

As mentioned above, subtitle reading patterns largely depend on the type of viewers. Fluent readers have been found to have no difficulty following subtitles. Diao et al. (2007), for example, found a direct correlation between the impact of subtitles on learning and the academic and literacy levels of participants. Similarly, given that “hearing status and literacy tend to covary” (Burnham et al. 2008, 392), some previous studies found important differences in the way hearing and hearing-impaired people watch subtitled programmes. Robson (2004, 21) notes that “regardless of their intelligence, if English is their second language (after sign language), they [i.e. Deaf people] cannot be expected to have the same comprehension levels as hearing people who grew up exposed to English”. This is indeed confirmed by Szarkowska et al. (forthcoming) who report that Deaf and hard of hearing viewers in their study made more fixations on subtitles and that their dwell time on the subtitles was longer compared to hearing viewers. This result may indicate a larger effort needed to process subtitled content and more difficulty in extracting information (see Holmqvist et al. 2011, 387-388). This, in turn, may stem from the fact that for some Deaf people the language in the subtitles is not their mother tongue (their L1 being sign language). At the same time, for hearing-impaired viewers, subtitles provide an important source of information on the words spoken in the audiovisual text as well as other information contained in the audio track, which in itself explains the fact that they would spend more time looking at the subtitles.

Viewer-external factors that impact on subtitle processing

The ‘smoothness’ of the subtitle reading process depends on a number of factors, including the nature of the audiovisual material as well as technical and graphical aspects of subtitles themselves. At a general level, genre has an impact on both the role of subtitles in the total viewing experience, and on the way viewers process the subtitles. For example, d’Ydewalle and Van Rensbergen (1989) found that children in Grade 2 paid less attention to subtitles if a film involved a lot of action (see d’Ydewalle & Bruycker 2007 for a discussion). The reasons for this could simply be that action film tends to have less dialogue in the first place, but secondly and more significantly, the pace of the visual editing and the use of special effects creates a stronger visual element which shifts the balance of content towards the action (visual content) and away from dialogue (soundtrack and therefore subtitles). This, however, is an area that has to be investigated empirically. At a more specific level, technical characteristics of an audiovisual text such as film editing have an impact on the processing of subtitles.

1 Film editing

Film editing has a strong influence on the way people read subtitles, even beyond the difference in editing pace as a result of genre (for example, action and experimental films could typically be said to have a higher editing pace than dramas and documentaries). In terms of audience perception, viewers have been found to be unaware of standard film editing techniques (such as continuity editing) and are thus able to perceive film as a continuous whole in spite of numerous cuts – the phenomenon termed “edit blindness” (Smith & Henderson, 2008, 2). With more erratic and fast-paced editing, it stands to reason that the cognitive demands will increase as viewers have to work harder to sustain the illusion of a continuous whole.

When subtitles clash with editing such as cuts (i.e. if subtitles stay on screen over a shot or scene change), conventional wisdom as passed on by generations of subtitling guides (see Díaz Cintas & Remael 2007, ITC Guidance on Standards for Subtitling 1999) suggests that the viewer will assume that the subtitle has changed with the image and as a consequence they will re-read it (see above). However, Krejtz et al. (2013) reported that subtitles displayed over shot changes are more likely to cause perceptual confusion by making viewers shift their gaze between the subtitle and the rest of the image more frequently than subtitles which do not cross film cuts (cf. de Linde and Kay 1999). As such, the cognitive load is bound to increase.

2 Text chunking and line segmentation

Another conventional wisdom, perpetuated in subtitling guidelines and standards, is that poor line segmentation will result in less efficient processing (see Díaz Cintas & Remael 2007, Karamitroglou 1998). In other words, subtitles should be chunked per line and between subtitles in terms of self-contained semantic units. The line of dialogue: “He told me that he would meet me at the red mailbox” should therefore be segmented in something like the following ways:

He told me he would meet me
at the red mailbox.


He told me
he would meet me at the red mailbox.

Neither of the following segmentations would be optimal because the prepositional phrase ‘at the red mailbox’ and the verb phrase ‘he would meet me’, respectively, are split, which is considered an error:

He told me he would meet me at the
red mailbox

He told me he
would meet me at the red mailbox.

However, Perego et al. (2010) found that poor line segmentation in two-line subtitles did not affect subtitle comprehension negatively. They also investigated 28 subtitles viewed by 16 participants using a threshold line between the subtitle region and the upper part of the screen, or main film zone, but did not find a statistically significant difference between the well-segmented and ill-segmented subtitles in terms of fixation counts, total fixation time, or number of shifts between subtitle region and upper area. The only statistically significant difference they found was between the mean fixation duration within the subtitle area between the two conditions, with the mean fixation duration in the ill-segmented subtitles being on average 12ms longer than in the well-segmented subtitles. Although the authors downplay the importance of this difference on the grounds that the difference is so small, it does seem to indicate at least a slightly higher cognitive load when the subtitles are ill-segmented. The small number of subtitles and participants, however, make it difficult to generalize from their results, again a result of the fact that it is difficult to extract reading statistics for subtitles unless the reading behaviour can be quantified over longer audiovisual texts.

In a study conducted a few years later, Rajendran et al. (2013) found that “chunking improves the viewing experience by reducing the amount of time spent on reading subtitles” (2013, 5). This study compared conditions different from those investigated in the previous study, excluding the ill-segmented condition of Perego et al. (2010), and focused mostly on live subtitling with respeaking. In the earlier study, which focused on pre-recorded subtitling, the subtitles in the two conditions were essentially still part of one sense unit that appeared as one two-line subtitle. In the later study, the conditions were chunked by phrase (similar to the well-segmented condition of the earlier study but with phrases appearing one by one on one line), no segmentation (where the subtitle area was filled with as much text as possible with no attempt at segmentation), word by word (where words appeared one by one) and chunked by sentence (where the sentences showed up one by one). Regardless of the fact that this later study therefore essentially investigated different conditions, they did find that the most disruptive condition was where the subtitle appeared word by word – eliciting more gaze points (defined less strictly than in fixation algorithms used by commercial eye trackers) and more “saccadic crossovers” or switches between image and subtitle area. However, in this study by Rajendran et al. (2013), the videos were extremely short (under a minute), and the sound was muted, hampering the ecological validity of the material, and once again making the findings less suitable to generalization.

Although both these studies have limitations in terms of generalizability, they both provide some indication that segmentation has an impact on subtitle processing. Future studies will nonetheless have to investigate this aspect over longer videos to determine whether the graphical appearance, and particularly the segmentation of subtitles, has a detrimental effect on subtitle processing in terms of cognitive load and effectiveness.

3 Language

The language of subtitles has received considerable attention from psycholinguists in the context of subtitle reading. D’Ydewalle and de Bruycker (2007) examined eye movement behaviour of people reading standard interlingual subtitles (with the audio track in a foreign language and subtitles in their native language) and reversed subtitles (with the audio in their mother tongue and subtitles in a foreign language). They found more regular reading patterns in the standard interlingual subtitling condition, with the reversed subtitling condition having more subtitles skipped, fewer fixations per subtitle, etc. (see also d’Ydewalle and de Bruycker 2003 and Pavakanun 1993). This is an interesting finding in itself, as it is the reversed subtitling that has been found to be particularly conducive to foreign language learning (see Díaz Cintas and Fernández Cruz 2008, and Vanderplank 1988).

Szarkowska et al. (forthcoming) examined differences in reading patterns of intralingual (Polish to Polish) and interlingual (English to Polish) subtitles among a group of Deaf, hard of hearing and hearing viewers. They found no differences in reading for the Deaf and hard of hearing audiences, but hearing people made significantly more fixations to subtitles when watching English clips with interlingual Polish subtitles than Polish clips with intralingual Polish subtitles. This confirms that the hearing viewers processed the subtitles to a significantly lower degree when they were redundant, as in the case of intralingual transcriptions of the soundtrack. What would be interesting to investigate in this context is those instances when the hearing audience did in fact read the subtitles, to determine to what extent and under what circumstances the redundant written information is used by viewers to support their auditory intake of information.

In a study on the influence of translation strategies on subtitle reading, Ghia (2012) investigated the differences in the processing of literal vs. non-literal translations into Italian of an English film clip (6 minutes) when watched by Italian EFL learners. According to Ghia, just as subtitle format, layout, and segmentation have the potential to affect visual and perceptual dynamics, the relationship translation establishes with the original text means that “subtitle translation is also likely to influence the perception of the audiovisual product and viewers’ general reading patterns” (2012,175). Ghia particularly wanted to investigate the processing of different translation strategies in the presence of sound and image with the subtitles. In her study she found that the non-literal translations (where the target text diverged from the source text) resulted in more deflections between text and image. This is similar to the findings of Rajendran et al. (2013) in terms of less fluent graphics in word-by-word subtitles.

As can be seen from the above, the aspect of language processing in the context of subtitled audiovisual texts has received some attention, but has not to date been approached in any comprehensive manner. In particular, there is a need for more psycholinguistic studies to determine how subtitle reading differs from the reading of static text, and how this knowledge can be applied to the practice of subtitling.

Measuring subtitle processing

1 Attention distribution and presentation speed

In the study by Jensema et al. (2000), subjects spent on average 84% of the time looking at subtitles, 14% at the video picture and 2% outside of the frame. The study represents an important early attempt to identify reading patterns in subtitle reading, but it has considerable limitations. The study had only six participants, three deaf and three hearing, and the video clips were extremely short (around 11 seconds each), presented with English subtitles (in upper case) without sound. The fact that there was no soundtrack therefore impacted on the time spent on the subtitles. In Perego et al’s study (2010), the ratio is reported as 67% on the subtitle area and 33% on the image. In this study there were 41 Italian participants who watched a 15-minute clip with Hungarian soundtrack and subtitles in Italian. As in the previous study, the audience therefore had to rely heavily on the subtitles in order to follow the dialogue. Kruger et al. (2014), in the context of intralingual subtitles in a Psychology lecture in English, found a ratio of 43% on subtitles, 43% on the speaker and slides and 14% on the rest of the screen. When the same lecture was subtitled into Sotho, the ratio changed to 20% on the subtitles, 66% on the speaker and slides, and 14% on the rest of the screen. This wide range is an indication of the difference in the distribution of visual attention in different contexts with different language combinations, different levels of redundancy of information, and differences in audiences.

In order to account for “the audiovisual nature of subtitled programmes”, Romero-Fresco (in press) puts forward the notion of ‘viewing speed’ – as opposed to reading speed and subtitling speed – which he defines as “the speed at which a given viewer watches a piece of audiovisual material, which in the case of subtitling includes accessing the subtitle, the accompanying images and the sound, if available”. The perception of subtitled programmes is therefore a result of not only the subtitle reading patterns, but also the visual elements of the film. Based on the analysis of over seventy-one thousand subtitles created in the course of the Digital Television for All project, Romero Fresco provides the following data on the viewing speed, reflecting the proportion of time spent by viewers looking at subtitles and at the images, proportional to the subtitle presentation rates (see Table 1).

Viewing speed Time on subtitles Time on images
120wpm ±40% ±60%
150wpm ±50% ±50%
180wpm ±60%-70% ±40%-30%
200wpm ±80% ±20%

Table 1. Viewing speed and distribution of gaze between subtitles and images (Romero-Fresco) 

Jensema et al. also suggested that the subtitle presentation rate may have an influence on the time spent reading subtitles vs. watching the rest of the image: “higher captioning speed results in more time spent reading captions on a video segment” (2000, 275). This was later confirmed by Szarkowska et al. (2011), who found that viewers spent more time on verbatim subtitles displayed at higher presentation rates compared to edited subtitles displayed with low reading speed, as illustrated by Figure 8.

Figure 8. Fixation-count based heatmaps illustrating changes in attention allocation of hearing and Deaf viewers watching videos subtitled at different rates.

Figure 8. Fixation-count based heatmaps illustrating changes in attention allocation of hearing and Deaf viewers watching videos subtitled at different rates.

2 Mean fixation duration

Irwin (2004, 94) states that “fixation location corresponds to the spatial locus of cognitive processing and that fixation or gaze duration corresponds to the duration of cognitive processing of the material located at fixation”. Within the same activity (e.g. reading), longer mean fixation durations could therefore be said to reflect more cognitive processing and higher cognitive load. One would therefore expect viewers to have longer fixations when the subject matter is more difficult, or when the language is more specialized. Across activities, however, comparisons of fixation duration is less meaningful as reading elicits more shorter fixations than scene perception or visual scanning simply because of the nature of the activities. It is therefore essential in eye tracking studies of subtitle reading to distinguish between the actual subtitles when they are on screen, the rest of the screen, and the subtitle area when there is no text (between successive subtitles).

The difference between reading and scene perception is illustrated in Figure 9, demonstrating that fixations on the image tend to be longer (indicated here by a bigger circle) than those on subtitles (which indicates more focused viewing), and more exploratory in nature (see the distinction between focal and ambient fixations in Velichkovsky et al. 2005).

Figure 9. Differences in fixation durations between the image and subtitle text – from Polish TV series Londyńczycy.

Figure 9. Differences in fixation durations between the image and subtitle text – from Polish TV series Londyńczycy.

Rayner (1984) indicated the impact of different tasks on mean fixation durations, as reflected in Table 2 below:

Task Mean fixation duration (ms) Mean saccade size (degrees)
Silent reading 225 2 (about 8 letters)
Oral reading 275 1.5 (about 6 letters)
Visual search 275 3
Scene perception 330 4
Music reading 375 1
Typing 400 1 (about 4 letters)

 Table 2. Approximate Mean Fixation Duration and Saccade Length in Reading, Visual Search, Scene Perception, Music Reading, and Typing[1]

In subtitling, silent reading is accompanied by simultaneous processing of the same information in the soundtrack (in the same or another language) as well as of other sounds and visual signs (for a hearing audience, that is – for a Deaf audience, it would be text and visual signs). The difference in mean fixation duration in these different tasks therefore reflects the difference in cognitive load. In silent reading of static text, there is no external competition for cognitive resources. When reading out loud, the speaker/reader inevitably monitor his/her own reading, introducing additional cognitive load. As the nature of the sign becomes more abstract, the load, and the fixation duration increases, and in the case of typing, different processing, production and checking activities are performed simultaneously, resulting in even higher cognitive load. This is inevitably an oversimplification of cognitive load, and indeed the nature of information acquisition between reading successive groups of letters (words) in a linear fashion is significantly different from that of scanning a visual scene for cues.

Undoubtedly, subtitle reading imposes different cognitive demands, and these demands are also very much dependent on the audience. In an extensive study on the differences in subtitle reading between Deaf, hard of hearing and hearing participants, we found a high degree of variation in mean fixation duration between the groups, and also a difference between the mean fixation duration in the Deaf and the hard of hearing groups between subtitles presented at 12 characters per second and 15 characters per second (see Szarkowska et al. forthcoming).

  12 characters per second 15 characters per second
Deaf 241.93 ms 232.82 ms
Hard of hearing 218.51 ms 214.78 ms
Hearing 186.66 ms 186.58 ms

Table 3. Differences in reading subtitles presented at different rates

Statistical analyses performed on the three groups with mean fixation duration as a dependent variable and groups and speed as categorical factors produced a statistically significant main effect, further confirmed by subsequent t-tests that yielded statistically significant differences in mean fixation duration for both subtitling speeds between all three groups. The difference within the Deaf and hard of hearing groups was also significant between 12cps and 15cps. What this suggests is that reading speed has a more pronounced effect on Deaf and hard of hearing viewers than on hearing ones.

3 Subtitle reading

As indicated at the outset, one of the biggest hurdles in studying the processing of subtitles is the fact that the subtitles appear as image on image rather than text on image as far as eye tracking analysis software is concerned. Whereas reading statistics software can therefore automatically mark words as areas of interest in static texts, and then calculate number of regressions, refixations, saccade length, fixation duration and count as related to the specific words, this process has to be done manually for subtitles. The fact that it is virtually impossible to create similar areas of interest on the subtitle words that are embedded in the image over large numbers of subtitles makes it very difficult to obtain reliable eye tracking results on subtitles as text. This explains the predominance of measures such as fixation count and fixation duration as well as shifts between subtitle area and image in eye tracking studies on subtitle processing. As a result, many of these studies do not distinguish directly between looking at the subtitle area and reading the subtitles, and, “they tend to define crude areas of interest (AOIs), such as the entire subtitle area, which means that eye movement data are also collected for the subtitle area when there are no subtitles on screen, which further skews the data” (Kruger and Steyn, 2014, 109).

Although a handful of studies come closer to studying subtitle reading by going beyond the study of fixation counts, mean fixation duration, and shifts between subtitle area and image area, most studies tend to focus on amount of attention rather than nature of attention. Briefly, the exceptions can be identified in the following studies: Specker (2008) looks at consecutive fixations; Perego et al. (2010) add the path length (sum of saccade lengths in pixels) to the more conventional measures; Rajendran et al. (2013) add the proportion of gaze points; Ghia (2012) looks at fixations on specific words as well as regressions; Bisson et al. (2012) look at the number of subtitles skipped, and proportion of successive fixations (number of successive fixations divided by total number of fixations); and in one of the most comprehensive studies on the subject of subtitle processing, d’Ydewalle and De Bruycker (2007) look at attention allocation (percentage of skipped subtitles, latency time, and percentage of time spent in the subtitle area), fixations (number, duration, and word-fixation probability), and saccades (saccade amplitude, percentage of regressive eye movements, and number of back-and-forth shifts between visual image and subtitle).

In a recent study, Kruger and Steyn (2014) provide a reading index for dynamic texts (RIDT) designed specifically to measure the degree of reading that takes place when subtitled material is viewed. This index is explained as “a product of the number of unique fixations per standard word in any given subtitle by each individual viewer and the average forward saccade length of the viewer on this subtitle per length of the standard word in the text as a whole” (2014, 110). Taking the location and start time of successive fixations within the subtitle area when a subtitle is present as the point of departure, the number of unique fixations (i.e. excluding refixations, and fixations following a regression) is determined, as well as the average length of forward saccades in the subtitle. This information gives an indication of the meaningful processing of the words in the subtitle when the number of fixations per word, as well as the length of saccades as ratio of the length of the average word in the audiovisual text are calculated. Essentially, the formula quantifies the reading of a particular subtitle by a particular participant by measuring the eye movement during subtitle reading against what is known about eye movements during reading and perceptual span.

In a little more detail, the formula can be written as follows for video v, with participant p viewing subtitle s”:


(Kruger and Steyn, 2014, 110).

This index was validated by performing a comparison of the manual inspection of the reading of 145 subtitles by 17 participants, and makes it possible to study the reading of subtitles over extended texts. In their study, Kruger and Steyn (2014) use the index to determine the relationship between subtitle reading and performance in an academic context, finding a significant positive correlation between the degree to which participants read the subtitles and their performance in a test written after watching subtitled lectures. The RIDT therefore presents a robust index of the degree to which subtitles are processed over extended texts, and could add significant value to psycholinguistic studies on subtitles. Using the index, previous claims that subtitles have a positive or negative impact on comprehension, vocabulary acquisition, language learning or other dependent variables, can be correlated with whether or not viewers actually read the subtitles, and to what extent the subtitles were read.


From this overview of studies investigating the processing of subtitles on the moving image it should be clear that much still needs to be done to gain a better understanding of the impact of various independent variables on subtitle processing. The complexity of the multimodal text, and in particular the competition between different sources of information, means that a subtitled audiovisual text is a substantially altered product from a cognitive perspective. Much progress has been made in coming to grips with the way different viewers behave when looking at subtitled audiovisual texts, but there are still more questions than answers – relating, for instance, to differences in how people process subtitled content on various devices (cf. the HBBTV4ALL project). The use of physiological measures like eye tracking and EEG (see Kruger et al. 2014) in combination with subjective measures like post-report questionnaires is, however, continually bringing us closer to understanding the impact of audiovisual translation like subtitling on the experience and processing of audiovisual texts.



This study was partially supported by research grant No. IP2011 053471 “Subtitling for the deaf and hard of hearing on digital television” from the Polish Ministry of Science and Higher Education for the years 2011–2014.



Bisson, Marie-Josée, Walter Van Heuven, Kathy Conklin, and Richard Tunney. 2014. “Processing of Native and Foreign Language Subtitles in Films: An Eye Tracking Study.” Applied Psycholinguistics 35(2):399-418.

Burnham, Denis, Leigh Greg, Noble William, Jones Caroline, Tyler Michael, Grebennikov Leonid and Alex Varley. 2008. Parameters in television captioning for deaf and hard-of-hearing adults: effects of caption rate versus text reduction on comprehension. Journal of Deaf Studies and Deaf Education 13 (3):391-404.

de Linde, Zoé and Neil Kay. 1999. The Semiotics of Subtitling. Manchester: St. Jerome.

Diao, Y., Chandler, P., Sweller, J. 2007. The effect of written text on comprehension of spoken English as a foreign language. The American Journal of Psychology 120(2): 237-261.

Díaz Cintas, Jorge and Marco Fernandez Cruz. (2008) “Using subtitled video materials for foreign language instruction”. In The Didactics of Audiovisual Translation edited by Jorge Díaz Cintas, 201-214. Amsterdam/Philadelphia: John Benjamins.

Díaz Cintas, Jorge and Aline Remael. 2007. Audiovisual Translation: Subtitling. Manchester: St. Jerome.

d’Ydewalle, Géry and Wim De Bruycker. 2003. Reading native and foreign language television subtitles in children and adults. In The mind’s eyes: Cognitive and applied aspects of eye movement research, edited by J. Hyönä, R. Radach and H. Deubel, 444-461. New York: Springer-Verlag.

d’Ydewalle, Géry and Wim De Bruycker. 2007. “Eye Movements of Children and Adults while Reading Television Subtitles.” European Psychologist 12:196–205.

d’Ydewalle, Géry and Ingrid Gielen. 1992. “Attention Allocation with Overlapping Sound, Image, and Text.” In Eye Movements and Visual Cognition: Scene Perception and Reading, edited by Keith Rayner, 415–427. New York: Springer-Verlag.

d’Ydewalle, Géry, Johan Van Rensbergen, and Joris Pollet. 1987. Reading a message when the same message is available auditorily in another language: The case of subtitling. In Eye Movements: From Physiology to Cognition edited by J.K O’Reagan and A. Lévy Schoen, 313-321. Amsterdam: Elsevier Science Publishers B.V. (North-Holland).

Ghia, Elisa. 2012. “The Impact of Translation Strategies on Subtitle Reading.” In Eye Tracking in Audiovisual Translation, edited by Elisa Perego, 155–182. Roma: Aracne Editrice.

Gottlieb, Henrik. 1998. Subtitling. In Routledge Encyclopaedia of Translation Studies, edited by Mona Baker, 244-248. London & New York: Routledge.

Hershler, Orit and Shaul Hochstein. 2005. At first sight: a high-level pop out effect for faces. Vision Research, 45, 1707–1724.

Holmqvist, Kenneth et al. 2011. Eyetracking. A Comprehensive Guide to Methods and Measures. Oxford: Oxford University Press.

Irwin, David E. 2004. Fixation location and fixation duration as indices of cognitive processing. In J.M. Henderson & F. Ferreira (Eds.), The interface of language, vision, and action: Eye movements and the visual world, 105-133. New York, NY: Psychology Press.

ITC Guidance on Standards for Subtitling. Online at:

Jensema, Carl. 2000. Eye movement patterns of captioned TV viewers. American Annals of the Deaf vo. 145, no. 3, 275-285.

Karamitroglou, Fotios. 1998. A Proposed Set of Subtitling Standards in Europe. Translation Journal 2(2).

Krejtz, Izabela, Agnieszka Szarkowska, and Krzysztof Krejtz. 2013. “The Effects of Shot Changes on Eye Movements in Subtitling.” Journal of Eye Movement Research 6 (5): 1–12.

Kruger, Jan-Louis and Faans Steyn. 2014. “Subtitles and Eye Tracking: Reading and Performance.” Reading Research Quarterly 49 (1): 105–120.

Kruger, Jan-Louis, Esté Hefer, and Gordon Matthew. 2013a. “Measuring the Impact of Subtitles on Cognitive Load: Eye Tracking and Dynamic Audiovisual Texts.” Proceedings of Eye Tracking South Africa 29-31 August 2013, Cape Town.

Kruger, Jan-Louis, Esté Hefer, and Gordon Matthew. 2013b. The impact of subtitles on academic performance at tertiary level. Paper presented at the Linguistics Society of Southern Africa annual conference in Stellenbosch, June, 2013.

Kruger, Jan-Louis. 2013. “Subtitles in the Classroom: Balancing the Benefits of Dual Coding with the Cost of Increased Cognitive Load.” Journal for Language Teaching 47(1):29–53.

Kruger, Jan-Louis, Hefer, Esté, and Gordon Matthew. 2014. Attention distribution and cognitive load in a subtitled academic lecture: L1 vs. L2. Journal of Eye Movement Research 7(5):4, 1–15.

Langton, Stephen R.H., Anna S. Law, Burton, A. Mike and Stefan R. Schweinberger. 2008. Attention capture by faces. Cognition, 107:330-342.

Pavakanun, Ubowanna. 1992. Incidental acquisition of foreign language through subtitled television programs as a function of similarity with native language and as a function of presentation mode. Unpublished doctoral thesis, Leuven, Belgium, University of Leuven.

Perego, Elisa, Fabio Del Missier, Marco Porta and Mauro Mosconi. 2010. “The Cognitive Effectiveness of Subtitle Processing.” Media Psychology 13(3):243–272.

Rajendran, Dhevi, Andrew Duchowski, Pilar Orero, Juan Martínez, and Pablo Romero-Fresco. 2013. “Effects of Text Chunking on Subtitling: A Quantitative and Qualitative Examination.” Perspectives: Studies in Translatology 21(1):5–31.

Rayner, Keith. 1984. Visual selection in reading, picture perception, and visual search: A tutorial review. In Attention and performance edited by H. Bouma and D. Bouhwhuis, vol. 10. Hillsdale, NJ: Erlbaum.

Rayner, Keith 1998. “Eye movements in reading and information processing: Twenty years of research.” Psychological Bulletin, 124:372–422.

Robson, Gary D. 2004. The closed captioning handbook. Amsterdam: Elsevier.

Romero Fresco, Pablo (in press) The Reception of Subtitles for the Deaf and Hard of Hearing in Europe. Peter Lang.

Smith, Tim, and John M. Henderson. 2008. Edit Blindness: The relationship between attention and global change blindness in dynamic scenes. Journal of Eye Movement Research 2(2), 6:1-17.

Specker, Elizabeth, A. 2008. L1/L2 Eye Movement Reading of Closed Captioning: A Multimodal Analysis of Multimodal Use. Unpublished PhD thesis. University of Arizona.

Szarkowska, Agnieszka, Krejtz, Izabela, and Łukasz Dutka. (forthcoming) The effects of subtitle presentation rate, text editing and type of subtitling on the comprehension and reading patterns of subtitles among deaf, hard of hearing and hearing viewers. To appear in: Across Languages and Cultures 2016, vol. 2.

Szarkowska, Agnieszka, Krejtz, Izabela, Kłyszejko, Zuzanna and Anna Wieczorek. 2011. “Verbatim, standard, or edited? Reading patterns of different captioning styles among deaf, hard of hearing, and hearing viewers”. American Annals of the Deaf 156 (4):363-378.

Vanderplank, Robert. 1988 “The value of teletext sub-titles in language learning”. ELT Journal 42(4):272-81.

Velichkovsky, Boris M., Joos, Markus, Helmert, Jens R., and Sebastian Pannasch. 2005. Two Visual Systems and Their Eye Movements: Evidence from Static and Dynamic Scene Perception. InCogSci 2005: Proceedings of the XXVII Conference of the Cognitive Science Society, 2283–2288.

Winke, Paula, Susan Gass, and Tetyana Syderenko. 2013. “Factors Influencing the Use of Captions by Foreign Language Learners: An Eye Tracking Study.” The Modern Language Journal 97 (1):254–275.

Yarbus, Alfred L. 1967. Eye movements and vision. New York, NY: Plenum Press.



[1] Values are taken from a number of sources and vary depending on a number of factors (see Rayner, 1984)



Jan-Louis Kruger is director of translation and interpreting in the Department of Linguistics at Macquarie University in Sydney, Australia.  He holds a PhD in English on the translation of narrative point of view. His main research interests include studies on the reception and cognitive processing of audiovisual translation products including aspects such as cognitive load, comprehension, attention allocation, and psychological immersion.

Agnieszka Szarkowska, PhD, is Assistant Professor in the Institute of Applied Linguistics at the University of Warsaw, Poland. She is the founder and head of the Audiovisual Translation Lab, a research group working on media accessibility. Her main research interests lies in audiovisual translation, especially subtitling for the deaf and the hard of hearing and audio description.

Izabela Krejtz, PhD, is Assistant Professor at University of Social Sciences and Humanities, Warsaw. She is a co-founder of Eyetracking Research Center at USSH. Her research interests include neurocognitive and educational psychology. Her applied work focuses on pro-positive trainings of attention control, eye tracking studies in perception of audiovisual material and emotions regulation.

Sound and Sight: An Exploratory Look at Saving Private Ryan through the Eye Tracking Lens – Jennifer Robinson, Jane Stadler and Andrea Rassell


Using eye tracking as a method to analyse how four subjects respond to the opening Omaha Beach landing scene in Saving Private Ryan (Steven Spielberg, 1998), this article draws on insights from cinema studies about the types of aesthetic techniques that may direct the audience’s attention along with findings about cognitive resource allocation in the field of media psychology to examine how viewers’ eyes track across film footage. In particular, this study examines differences when viewing the same film sequences with and without sound. The authors suggest that eye tracking on its own is a technological tool that can be used to both reveal individual differences in experiencing cinema as well as to find psychophysiologically governed patterns of audience engagement.


Steven Spielberg’s Saving Private Ryan (1998) begins at a geriatric pace, ambling alongside an elderly World War II veteran as he visits a military cemetery and begins to reminisce about the men who saved his life during the Battle of Normandy in June, 1944. This is where the story really starts, with a platoon of terrified, seasick servicemen led by Captain John Miller (Tom Hanks) landing on Omaha Beach where they come under heavy fire by German infantry. The Omaha Beach landing scene is gruelling in its experiential intensity as the hand-held camera locates the audience alongside soldiers desperately fighting their way toward the enemy line amidst relentless machine gunfire and bone-shuddering explosions that tear them limb from limb.

An interdisciplinary 2014 study by Vittorio Gallese (one of the scientists credited with the discovery of mirror neurons), fellow neuroscientists Katrin Heimann and Maria Alessandra Umiltà, and film scholar Michele Guerra investigated the effects of camera movement on the audience’s feeling of involvement in film scenes and their ability to place themselves in the position of a screen character. This study was conducted using a high-density electroencephalogram (EEG) to test whether the audience’s experience of what Gallese (2012, 2013) terms “embodied simulation”—that is, neural mirroring responses that are associated with empathy—is affected by camera movement as well as by the action of human figures on screen. The researchers found that the relationship between cognition and action perception is significantly influenced by camera movement and that the use of camera techniques such as steadicam elicit stronger mirroring responses and an augmented sense of involvement in the scene because this type of cinematography more closely resembles human movement than static camera, zooms, or dolly-mounted tracking shots (Heimann et al. 2014, 2098–99).

These findings are consistent with eye tracking studies by Paul Marchant and colleagues who have demonstrated that the audience’s visual attention is captured and guided by mobile framing, focus, the direction of screen characters’ movement and lines of sight, and the colour and motion of other aspects of the mise-en-scene (Marchant et al. 2009, 157–58). This interplay of figure movement and the technical and aesthetic dimensions of cinematography is relevant to Saving Private Ryan in that the arresting beach landing scene at the start of the film is shot almost exclusively using hand-held camera to simulate human movement. The study by Heimann and colleagues suggests that this form of camera movement, teamed with the panicked motion of the figures on screen, functions to elicit a sense of affective identification with Captain Miller and the soldiers he leads by stimulating a shared experience of embodied confusion and sensory overload as the military men shake with fear and scramble to dodge the shrapnel ricocheting across the war-ravaged beach. The unstable gaze of the constantly moving camera makes it as difficult for the audience as it is for the soldiers in the scene to focus attention or see a pathway to safety and this shared perceptual experience may elevate neural mirroring responses or empathic concordance with observed actions.

Venturing into an area that has received less attention from either film scholars, media effects researchers or neuroscientists, we were struck by the acoustic ferocity of the Omaha Beach scene and we sought to understand the ways in which sound functions as a perceptual cue that may affect the cinema audience’s attention and modulate gaze patterns. This interdisciplinary study brings empirical eye tracking research into dialogue with formalist understandings of film style and cognitive engagement with narrative, using the following question to establish a framework for analysis: What audio-visual aesthetic cues guide the audience’s attention and what psychophysiological processes underlie audience responses to the screen? In particular, we draw on existing research on film dialogue by Todd Berliner and others and we supplement eye tracking data by drawing on Lisa Coulthard’s concept of “dirty sound,” Vivian Sobchack’s work on the “sonic imagination” and cognitivist methods of aesthetic film analysis to work through the experiential dimensions of the sonic confusion generated in the scene.

Cognitivist film theory, as advanced by scholars such as David Bordwell and Carl Plantinga, conceptualises film and television spectatorship as the active construction of meaning via the inferential elaboration of perceptual cues and formal screen production conventions. In a quest for greater explanatory power and a more holistic understanding of spectatorship that moves beyond rational thought and conscious inferential processes, film theorists are increasingly drawing upon empirical research in fields such as neuroscience, psychology and media effects to test assumptions about how audiences perceive and respond to screen texts, and to account for the sensory experiences and involuntary physiological reactions of the audience.

Psychophysiological Approaches to Cinema Studies

There are several different empirical approaches to studying audience members’ responses to film, including biometrics, neuroimaging, and psychophysiological techniques. Psychophysics is an area of research that quantifies physiological or bodily responses to psychological states. Neurocinema (Hasson et al. 2008) and Psychocinematics (Shimamura 2013) are emerging fields that connect these psychophysiological methods to cinematic experience. Where the neurocinematic approach involves imaging of the brain while watching cinema, in psychophysiology the subject’s physiological state is understood to be representative of psychological responses (for example, skin conductance and heart rate indicate arousal or an emotional reaction). One such response is an involuntary orienting response that assigns cognitive resources to processing stimuli in screen texts automatically.

Annie Lang (Lang 2000; Lang et al. 2000) proposes a model of responding to dynamic screen media that starts from the position that there are limited cognitive resources that any individual can bring to bear when processing mediated content. Features of the screen content can automatically consume some of those cognitive resources, which leaves less capacity for the intentional interpretation of meaning, formulation of hypotheses or speculation about protagonists’ motives (the very processes that cognitive film theory privileges). While this has been well developed for visual attributes, such as hard edits, movement and new features, Lang and colleagues are developing a similar catalogue of attributes for aural content (sound). Using a physiological indicator of an orienting response (a short, rapid decrease in heart rate just after the feature is introduced), they have identified “voice changes, music onsets, sound effect onsets, production effect onsets, emotional word onsets, silence onsets, and voice onsets” as aural cues that orient attention (Lang et al. 2014, 4).

Embodied responses to film are not necessarily indicative of cognitive processing as some responses occur in the autonomic nervous system (such as the startle response to a loud sound or a sudden movement); other processes involve the conscious allocation of cognitive resources. For example, seeing a poisonous reptile on screen can make the audience form hypotheses about impending danger, which can then prime emotional reactions such as anxiety. Increased heart rate during the shower scene in Psycho (Alfred Hitchcock, 1960), or light perspiration on the palms as viewers watch Grace Kelly fossick through the neighbour’s apartment in Rear Window (Alfred Hitchcock, 1954), are widely understood to be biological evidence of changes in psychological states in response to cinema. In such a state of arousal hormones are released, blood pressure rises, and brain wave patterns shift. These biological changes can be recorded using non-invasive techniques and have proven to be stable markers of psychophysiological changes. Some commonly used psychophysiological measures are eye tracking, Galvanic Skin Response (GSR), and pupillometry; however, in this exploratory study, only eye tracking has been used.

Eye Tracking

Eye tracking is a technique that can measure the movements of the eye by gauging the direction of infrared light bounced off the eye surface. The most common technique utilises the eye’s physiology to create different reflections of the light source from the pupil and cornea that are captured by two cameras and used to used to track the gaze and control for head and eye movements. While there are several types of eye tracking devices, those most pertinent to this study include eye trackers that require the viewer to be in a fixed position such as seated in front of a monitor, and those that can be head-mounted or worn like glasses by a mobile viewer. Eye tracking devices are used in a wide variety of fields including marketing, sports coaching and user experience. The range of Tobii Technology eye trackers are frequently employed as research tools to measure attention, as is the case in this study. Two of the main characteristics of eye movement that can be measured by eye tracking devices are saccades and fixations.


In order to collect high-quality visual data about our environment, the eye needs to be constantly redirected. We use movements called saccades in order to do this. Saccades occur at a rate of about 2-3 per second (Tatler 2014) and can be voluntary or reflexive (Duchowski 2007). Their duration ranges from 10-100 ms, rendering the individual effectively blind during this time, but not for long enough to be perceivable: “Visual sensitivity effectively shuts down during a saccade via a process known as saccadic suppression, in order to ensure that the rapid movement of light across the retina is not perceived as motion blur” (Smith 2014, 86).


A fixation is a length of time when the eyes stop large movements (saccades) and stay focused on a small visual range (typically about 5 degrees). Fixations should not be thought of as static, as the name implies, but as “miniature eye movements: tremor, drift and microsaccades” (Duchowski 2007, 46). Their duration is usually in the range of 150-600 ms (Duchowski 2007) and most visual information is processed when the eyes stabilize or fixate on a point on the screen (Smith 2014, 86).

Previous findings

A consistent finding from eye tracking research that is relevant to this study of cinema is that when scenes are viewed on a screen or a monitor, the gaze tends to fixate at the centre more than the periphery, even when salient features are not located in the middle of the frame. Because this tendency may be adaptive (for example the centre is a good resting place for fast response to new action that requires attending to), rather than solely visual, Benjamin Tatler (2014) warns against a reductive expectation that these fixations are caused by visual stimuli alone. While this study attends closely to visual stimuli and the aesthetic techniques used by filmmakers to direct attention, we also consider aural stimuli and involuntary biological responses.

Despite the large body of eye tracking research, Antoine Coutrot and colleagues claim that until recently, only two preliminary studies had investigated the influence of sound on eye movements and patterns of attention when watching film or video footage (Coutrot et al. 2012, 2).[i] When studying eye movements in response to the presence and absence of sound in audiovisual stimuli, Coutrot et al. analyse differences in three further eye tracking metrics: dispersion, distance to centre, and Kullback-Lieber Divergence. Dispersion refers to the “variability of eye positions between observers” (2012, 4). Distance to centre is a measurement of “the distance between the barycenter of a set of eye positions and the centre of the screen” (Coutrot et al. 2012, 4). Kullback-Leiber divergence “is used to estimate the difference between two probability distributions. This metric can be compared as a weighted correlation measure between two probability density functions… The lower the KL-divergence is, the closer the two distributions are… If soundtrack impacts on eye position locations, we should find a significant difference between the mean inter and intra KL-divergences” (Coutrot et al. 2012, 4). Dispersion provides information about the variability between eye positions, but does not determine the relative position of the two data sets of the eye positions for the two stimulus conditions (sound on/sound off). For the KL-divergence, it is the opposite.

Coutrot and colleagues (2012) found that eye movements follow a consistent pattern that is involuntary and that is not affected by screen aesthetics, narrative content, genre, sound or other factors in the first second following an edit. After a brief latent phase, the eye automatically refocuses on the centre of the screen after a cut and takes a second to adjust to the new image. Thereafter, they found that sound does influence gaze patterns in the following ways: dispersion is lower in the sound on condition than the sound off condition; fixation locations are different between the two conditions; sound results in larger saccades than the same footage without sound; and sound elicits longer fixations than sound off (Coutrot et al. 2012, 8).

More recently, Coutrot and Guyader found that “removing the original soundtrack from videos featuring various visual content impacts eye positions increasing the dispersion between the eye positions of different observers and shortening saccade amplitudes” (2014, 2). This study also found that in dialogue scenes, the audience’s attention tends to “follow speech turn taking more closely” (Coutrot and Guyader 2014, 1). A 2014 study by Tim Smith also investigated the cross-modal influences of audio on visual attention and found that “When the visual referent is present on the screen, such as the face of a speaker (that is, a diegetic on-screen external sound source), gaze will be biased towards the sound source, and towards the lips if the audio is difficult to interpret” (Smith 2014, 92). This accords with research in film studies into dialogue and conversation in movies. For instance, Berliner notes that movie dialogue is typically scripted to advance the narrative by directing the audience’s attention to key plot points and protagonists;[ii] furthermore, “characters in Hollywood movies communicate effectively and efficiently through dialogue” and “movie characters tend to speak flawlessly” (Berliner 2010, 191). Similarly, Aline Remael identifies the promotion of narrative continuity and textual cohesion as two of the chief functions of film dialogue (2003, 227; 233). Given these findings from two different fields of research, we pay particular attention to gaze patterns during dialogue exchanges in the analysis of Saving Private Ryan that follows.


Building on previous work by Tim Smith, Antoine Coutrot, Nathalie Guyador and other researchers who have used eye tracking to investigate attentional synchrony[iii] (as illustrated in gaze plots and heat maps that represent the concentration of the audience’s gaze), our methodology examines the distribution of fixations across nine smaller central Areas of Interest (AOIs) during film sequences to explore what is occurring for viewers who are not following the predicted pattern and instead are searching for something else. Using two conditions as stimuli (film with sound on, and film with sound off), we conducted a qualitative comparison between and within the viewing patterns of four subjects. Within the limitations of a qualitative and exploratory study with only four subjects, we drilled down to conduct a fine-grained mapping of attention to determine whether it functions in a predictable way in relation to previous findings about dialogue scenes, sonic cues and attention in relation to camera and figure movement.

For the purposes of this study, a Tobii X-120 eye tracker and Tobii Studio 2.3.2 software (Tobii Technology, Stockholm, Sweden) were used to record seven individual subjects (five females, two males) as they watched film footage. As this was an exploratory study, subjects were recruited from the researchers’ networks, with ethics approval. They were seated and positioned 55-65 cm away from the eye tracker for viewing. All subjects were recorded on the same Tobii X-120, in the same room, with the film footage played on the Tobii computer to standardise start times for all subjects to enable comparison in later analysis. Each subject was successfully calibrated by looking at symbols in different areas of the screen, which ensures the eye tracker gets a reliable measure of gaze location across the whole screen. After the viewing session, each subject’s data was analysed for quality, with three subjects excluded because one condition had segments with lower reliability than desired. Thus, the results are reported on the basis of four subjects with high quality and complete data (three females, one male).

We analysed the areas of the screen where subjects looked while watching discrete sequences of the key beach-landing scene at the beginning of Saving Private Ryan. We investigated how different stylistic techniques employed in the following four consecutive sequences of the scene affect the audience’s gaze patterns:

  1. The “Indistinct Dialogue” sequence is an 11-second clip that was chosen with a view to finding out how the audience’s attention is affected when dialogue is overridden by chaotic background noise, forcing the audience to strain to decipher what is being said. This part of the scene occurs immediately after Captain Miller has located the men under his command. The first shot is an unsteady medium close-up of Miller shouting to Horvath (Tom Sizemore) as bullets splash in the surf around him and ping off the metal structure he is crouching behind. Miller yells, “Sergeant Horvath! (Explosion.) Move your men off the beach! (Water splashes up noisily.) Now!” The next shot shows Horvath’s response in a hand-held medium close-up as he points at his men and hollers, “OK you guys, get on my ass! (Directional hand signals as bullet hits metal and drowns out dialogue.) Follow me! (Horvath ducks as a mortar shell explodes, screen right.)
  2. The “Wounded Man” sequence that occurs as the men move up the beach is a 30-second segment that is noteworthy because it includes a subjective sequence that solicits audience engagement with Captain Miller’s experience of temporary hearing loss following the concussive impact of a mortar shell nearby. This clip begins with a hand-held long shot of carnage on the beach as Miller moves to the right, dragging Briggs, a wounded soldier he is trying to help. The audience hears artillery fire, crashing, splashing, and shouting as Miller lugs Briggs into the mid-ground, with explosions and debris visible in the foreground. As mortar shells hit, spraying blood and water upwards, Miller hollers for a medic. Following a massive explosion, the sound of gunfire in the background is muted and is replaced with the subdued drone of a low, echoing, wind-like sound that communicates Miller’s subjective experience of shellshock as sand obscures everything and Miller falls to the ground. In slow motion, we cut to a low level close-up of boots in the sand. Miller scrambles up and the hand held camera follows him. Other soldiers pass in front of the camera, occluding the lens and masking the edit. The sounds of the battlefield are replaced by an echoing, low frequency droning noise and the subdued clink of military gear as Miller is momentarily dazed by shellshock. As he gets up and grabs Briggs’s arm, the sound of artillery returns loudly and we see Miller in long-shot framed with a low level camera. He staggers, looks back, and realises that Briggs is dead: his lower abdomen and legs have been blasted away. This sequence ends with a close-up of Miller’s reaction as he looks at Briggs in shock, abandons him, crawls away from the camera, then stands and runs toward the sand dunes, into enemy fire and gun smoke.
  3. The “Sand Dunes” sequence is one unusually long and complex shot that lasts for a full minute; however, in the interests of generating a more granular analysis we divided the shot in two. The first 25 seconds “Sand Dunes: In Command” begins with a match on action as the body of a soldier that was catapulted into the air by a grenade in the previous shot now hits the ground. Quickly, the hand-held camera tilts down from the long-shot of the falling soldier, pans left, and follows Miller forward as he dives behind a ridge of sand for shelter. The camera pushes in to frame Miller in close up as the dialogue begins:

Miller: (Turns left to address the radio operator.) Shore Party! No armour has made it ashore. We got no DD tanks on the beach. Dog One is not open. (Miller rolls to the right so he is framed in an over-the-shoulder shot as he shouts to other soldiers seen in medium-long shot on the dune.) Who is in command here?

Soldier: You are, Sir.

Miller: Sergeant Horvath!

Horvath: Sir!

Miller: Do you know where we are?

Horvath: Right where we’re supposed to be, but nobody else is …

4. “Sand Dunes: Radio” is a 35-second continuation of the shot detailed in sequence three, beginning when the hand-held camera pans left as Miller rolls back toward the radio operator, facing the camera in close up as he grabs the radio operator’s shoulder and hollers in his ear, straining to be heard over the background gunfire.

Soldier: (Distant, off screen, as Miller rolls to the left.) Nobody’s where they’re supposed to be!

Miller: (To radio operator) Shore Party! First wave ineffective. We do not hold the beach. Say again, we do not hold the beach. (Miller turns and rolls back towards the right, away from the camera. The camera zooms toward Horvath, excluding Miller from the frame as he listens to Horvath.)

Horvath: (Indistinct) We’re all mixed up, sir. We got the leftovers from Fox Company, Able Company and George Company. Plus we got some Navy Demo guys and a Beachmaster.

Miller: (The camera follows Horvath as he rolls to the left, toward Miller; we then see Miller in medium close-up as he turns back to radio operator.) Shore party! Shore party! (Realises radio operator is deceased; grabs hold of the radio himself.) Cat-F, Cat-F, C-… (Miller realises the radio is dead.)

In Overhearing Film Dialogue, Sarah Kozloff states that “although what the characters say, exactly how they say it, and how the dialogue is integrated with the rest of the cinematic techniques are crucial to our experience and understanding of every film since the coming of sound, for the most part analysts incorporate the information given by a film’s dialogue and overlook the dialogue as signifier” (2000, 6). By contrast, our analysis focuses closely on what the characters say, and also on what they hear. It may seem counterintuitive to be investigating the significance of sound in an eye tracking study, because sound is something that the eyes are not normally required to process. However, the Omaha Beach dialogue sequences are unusual because the audience has to rely on their eyes to search for contextual cues in order to fill in gaps in understanding due to indistinct vocals. Such cues include the direction of figure movement and eye lines (when Horvath yells, “Get on my ass and follow me!” but his words are obscured by background sound), or facial expressions and body language when words don’t fully make sense due to the inclusion of military terminology or unfamiliar radio communication codes and incomplete communicative exchanges (when Miller shouts “Dog-One” and “Cat-F” into the radio and receives no response). Coutrot and colleagues identify numerous instances in which aural and visual stimuli interact to affect attention (they offer as one example, “the help given by ‘lip reading’ to understanding speech, even more when speech is produced in poor acoustical conditions,” as is the case in Saving Private Ryan); furthermore, they report that “perceivers gazed more at the mouth as auditory masking noise levels increased” (Coutrot et al. 2012, 2).[iv] Our study builds on this work, as detailed below.

With respect to each of the four sequences outlined above, two different analyses were conducted. Given that most of the viewing was within the central area of the screen, we subdivided that area into a three by three grid providing nine smaller areas of interest, as illustrated in Figure 1.[v] The total time fixated, and mean fixation duration, was calculated for each of the nine Areas of Interest (AOI). For the default method on the Tobii eye tracking system that we used, a fixation is identified when the gaze is steadily focused in the same area of X-Y coordinates on the screen, typically occurring within 35 pixels. This technique of dividing the centre of the screen into nine smaller AOIs allowed greater granularity in determining the primary AOI where most people attended and the instances where individuals diverged and looked at other parts of the screen.

Figure 1: Nine Areas of Interest (AOI)

Figure 1: Nine Areas of Interest (AOI), Saving Private Ryan: Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

The second step was to analyse whether the participants exhibited attentional synchrony and primarily look at the same AOIs as each other or displayed individual variation. For this, we calculated an estimate of attentional distribution. For this exploratory study, a simple ratio of time spent following the guided (dominant) viewing pattern compared to time looking at other parts of the screen was calculated.[vi] The dominant AOIs were determined by which AOIs account for over 50% of fixation time in the sound on condition (control). As some sequences guide the viewer between several of the AOIs, the combination that yielded a majority of viewing time was used rather than simply the AOI with the greatest portion of time. The Distribution ratio was then calculated as follows:

Distribution Ratio =

Number non-dominant AOIs viewed multiplied by amount of time in those AOIs

Number of AOIs in dominant view multiplied by amount of time in the dominant AOIs

Eye Tracking Results and Discussion

The distribution ratio was intended to see whether the findings of Coutrot et al. (2012) applied to what we saw in the sequences from the Omaha Beach landing scene. We hypothesised that a lack of sound would increase divergence away from the dominant AOI, compared to the null condition with sound that should reinforce “attentional synchrony” (Smith 2013) by guiding the viewer to the most important focal area. However, we expected that because some of the sequences we were examining contained many competing audio and visual contextual cues, we might not see such a clear distinction. Furthermore, we anticipated that our findings in relation to dialogue may diverge from Coutrot and Guyader (2014) because the Omaha Beach scene has atypical dialogue sequences that deviate from conventional turn-taking and are overloaded with noise and movement to create a sense of confusion.

We found that averaged across all sequences, three of the viewing subjects followed the expected pattern of greater divergence with sound off, as indicated by a positive difference score in Table 1. The fourth (Subject 3) had slightly higher gaze distribution when there was sound but did have an increase in the mean number of AOIs with fixations in the “sound off” condition. This would suggest that, on the whole, sound does function to focus attention more tightly. Given the small number of participants in this exploratory study, this result is encouraging.

    Table 1: Distribution Ratios Arranged by Subject     *Note: A positive difference indicates that in line with Coutrot et al. (2012), there was greater distribution of attention across AOIs for the no sound condition. The higher distribution ratio when sound was off could be due to either more total time fixated away from the dominant AOI or having fixations in more of the nine AOIs, being more spread out, or a combination of both.

Table 1: Distribution Ratios Arranged by Subject
*Note: A positive difference indicates that in line with Coutrot et al. (2012), there was greater distribution of attention across AOIs for the no sound condition. The higher distribution ratio when sound was off could be due to either more total time fixated away from the dominant AOI or having fixations in more of the nine AOIs, being more spread out, or a combination of both.

However, not all sequences of the beach scene elicited the same results. When the distribution data is averaged by sequence, it turns out that sequences 1 (d = -3.4) and 3 (d = -1.8) were strongly in the predicted direction with less distribution across AOIs when there was sound. For sequence 2 there was little difference (d < 1.0), but it was still in the predicted direction. However, for sequence 4 (d = 1.0), there was greater distribution of the fixations away from the dominant AOI in the sound on condition, indicating greater focus when there was no sound. Sequence 4 (“Sand Dunes: Radio”) does not follow screen conventions for shooting dialogue: it is shot in one long take rather than the customary shot-reverse-shot style and because many of Captain Miller’s lines contain military jargon and receive no response, the audience’s habituated expectations about turn-taking and shifting attention from speaker to speaker are derailed. Breaking with aesthetic and technical conventions may disrupt cognitive process of meaning-making when watching film. Another, more physiologically based reason that the “Sand Dunes: Radio” sequence may not conform to the gaze distribution patterns found in other parts of the scene is that it is the continuation of a very long take (together the sand dune sequences constitute a single, minute-long shot that viewers watched unbroken); consequently, viewers’ eyes are not re-focused on the centre of the screen following a cut and their eyes have more time to rove and explore the visual field for other meaningful cues. Put another way, without any cuts to generate an orienting response during this sequence there is no automatic allocation of cognitive resources to the story or refocusing of attention back onto a particular portion of the screen (Lang 2000, 2014). It is then a very individual response to novel and emotive (signal) cues within the scene that drives where each subject looks over the duration of this sequence. With so many different types of auditory cues that orient the viewer (Lang 2014), it is not surprising that viewers had fixations in the various AOIs we analysed on the screen.

In Saving Private Ryan, as has been found to be the case in other films such as Sergei Eisenstein’s 1938 historical war epic, Alexander Nevsky (see Smith 2014), the overall viewing patterns reflect the intention of the director in that audience members typically look where they are guided to by audio-visual screen conventions. Yet, an important reminder for further investigation is that not all members of an audience respond in the same way to each scene. This leads us to question what cues other than the lack of a sound track might lead to increased gaze distribution.

Even though the size of the nine AOIs and the number of participants was small, paired-sample t-tests were conducted to see whether there were any statistically significant differences at the level of each of the nine AOIs between the sound and no sound viewing experiences of the participants. No significant differences between the sound on and off conditions were found for fixation duration, total time fixated, visit duration or total time visiting any particular AOI for three of the sequences: “Indistinct Dialogue,” “Wounded Man,” or “Sand Dunes: Radio.” However, for “Sand Dunes: In Command,” subjects spent significantly (p < .05) less time looking at areas 5 and 6 when the sound was off (significant greater mean fixation duration, total time fixated and total time visiting AOI 5 when the sound was on; only total time fixated on AOI 6 was greater when sound was present). Interestingly, this did not translate to an increase in any particular AOI so it seems their gazes spread out significantly (dispersed) in the sound off condition.

With a small sample size, it is not surprising that there were few statistical differences, but it was surprising that focusing on the central area as represented by the nine AOIs did not pick up what seemed “obvious” when looking at the aggregated gaze plots. For example, in the “Indistinct Dialogue” sequence (see Figure 2) there is an explosion on the right-hand side of the screen that equally drew the attention of the subjects when sound was off as well as on (the screen characters ducked as the mortar shell whistled in, so aural and visual cues reinforced each other). Although there was one subject who looked down to the lower right part of the screen outside our central area of view when sound was off, the overall pattern was consistent in both conditions.

Figure 2a (above) and Figure 2b (below): Gaze plots of four subjects for “Indistinct Dialogue” (Sequence 1)

Figure 2a (above) and Figure 2b (below): Gaze plots of four subjects for “Indistinct Dialogue” (Sequence 1): Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

We explored whether eye tracking could help reveal differences in viewing experience amongst subjects or even offer insight into what was happening beneath the surface of apparent synchrony. An obvious finding is that each individual subject had a different gaze pattern across the four sequences sampled, as can be seen when examining the pattern as recorded for Subject 2 in Figure 4 and Figure 6. For the longer sequences, their gaze fixated in more than half of the AOIs, while for the shorter sequences they were often more focused on particular AOIs. However, this pattern could change with sound on and off. For example when examining the pattern for Subject 1 for “Indistinct Dialogue”, there was a noticeable difference between sound on and off such that they only viewed three AOIs with sound, but with sound off they spread out to three new AOIs, ranging across six in total. The greatest shift was away from time fixated in the top left corner of our central area (when the footage was played with sound), contracting to the central third of the screen (without sound).

Sean Redmond and colleagues reported that the presence of sound only has an effect on fixation duration (number of fixations) for the “Wounded Man” sequence (forthcoming 2015). In the “Wounded Man” sequence, there was no overall difference in gaze location with sound on or sound off. However, the gaze fixation pattern for Subject 2 showed a large qualitative difference (see Figure 3). With sound off, Subject 2 only looked at AOIs 6, 8 and 9 (bottom right part of the central area, which is consistent with Miller’s screen direction and the action of falling to the ground and dragging the wounded soldier, Briggs, in this sequence). However, with sound on, Subject 2 fixates at least briefly in all nine AOIs, with the most time shifting to the centre of the screen where noisy background action is taking place and where other soldiers rapidly pass in front of the camera.

Figure 3: Areas of Interest for subject 2 in “Wounded Man” (Sequence 2)

Figure 3: Areas of Interest for subject 2 in “Wounded Man” (Sequence 2): Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

However, when comparing all of the subjects and how they responded to the “Wounded Man” sequence (see Figure 4), the other three subjects exhibit similar patterns of scanning across the AOIs when sound is on and off. This pattern is what we expected for this sequence, which incorporates a significant subjective sound component when Miller experiences shellshock and is temporarily stunned and deafened. It is possible that subjective sound may help to anchor the viewer’s attention to the character’s experience.[vii]

Figure 4: Total fixation duration in nine AOIs for “Wounded Man” (all subjects)

Figure 4: Total fixation duration in nine AOIs for “Wounded Man” (all subjects): Graphs produced by the authors

A final illustration is the “Sand Dunes: In Control” sequence. Subject 3 was interesting because their viewing patterns for the “Wounded Man” (sequence 2) and the “Sand Dunes: Radio” (sequence 4) were consistent with the other subjects; however, the focus for “Sand Dunes: In Command” (sequence 3) was inconsistent. There were clear AOIs for the sound off condition, but with sound their eyes wandered over more of the central area of the screen. Attention is focused on the radio operator in the sound off condition (as indicated by the red bar in AOI 4, middle left of Figure 5). However, with sound on (indicated by the blue bars), the subject’s attention extends to new AOIs, including corners (top left, bottom left, and bottom right) that were not fixated on when there were no sound cues.

Figure 5: Areas of Interest for subject 4 in “Sand Dunes: In Command” (Sequence 3)

Figure 5: Areas of Interest for subject 4 in “Sand Dunes: In Command” (Sequence 3): Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

A comparison between how the subjects fixated during this segment when the sound was on and off (Figure 6), illustrates a similar pattern of having eyes fixate in different AOIs when sound was on and off, except for Subject 1. Given the length of this sequence, the focus on three main AOIs for all subjects in the sound off condition is interesting in its consistency. However, even though the fixation data averages out to no difference between sound on and off, each of our subjects had a different response when sound was on—from fixating in all of the AOIs by Subject 4 to just staying longer on the same AOIs for Subject 1. The variation between sound on and off for this sequence may simply be an artefact of camera and figure motion, where shifts in the location of the protagonists’ faces on the screen can result in fixations in non-dominant AOIs (Mital et al. 2011, 19). However, the fact that these shifts did not occur in both conditions indicates that there is something different about those shifts when sound is on and the viewer can hear the dialogue. This is the only sequence where there was a significant difference in total time spent fixated in AOIs 5 and 6. The much lower time spent on key AOIs when sound was off suggests the subjects were looking to the periphery of the screen and did not look as much to the nine central AOIs.

Figure 6: Total fixation duration for nine AOIs in “Sand Dunes: In Command” (all subjects)

Figure 6: Total fixation duration for nine AOIs in “Sand Dunes: In Command” (all subjects): Graphs produced by the authors

Relating Eye Tracking Findings to Film Aesthetics

Our qualitative, exploratory analysis of gaze patterns in Saving Private Ryan has used eye tracking to offer an empirical account of cognitive-perceptual processing that includes sound and attends closely to audio-visual cues in the film’s stylistic system. In this way we have sought both to redress the limitations of theoretical approaches to film analysis that privilege inferential cognitive processes and to counterbalance the tendency of empirical studies to neglect the role of screen aesthetics in informing audience responses. In particular, we have built on existing work on eye tracking by taking account of how the aesthetic and experiential deployment of sound might affect perceptual processing, given that sound waves have “palpable force,” which means that sound seems “more materialized, more concrete, and more present to our experience than what we can see” (Sobchack 2005, 7). We have worked from the premise that expectations regarding character and narrative are chief among the ways that screen texts engage audiences in the construction of meaning, yet we have also acknowledged that the process of meaning-making is informed by aesthetic cues and by physiological sense-making, which is in part involuntary. The dangerous and chaotic Omaha Beach scene is what Man-Fung Yip refers to as an “intensified sensory environment” in which, “as a concrete visual and visceral force rather than a mere vehicle for semiotic signification, film violence offers an intensity of raw, immediate sensation that powerfully engages the eye and body of the spectator” (2014, 78). Like Yip, our interest has been sustained by “the complex interplay between the capacity of the human body and the resources of the cinematic medium” (2014, 89).

In her acoustic study of extreme cinema, Lisa Coulthard refers to the use of “deliberate sonic imperfections” (2013, 115) in films in which “visual assaultiveness” is paired with a disturbing soundscape: “Capable of impacting the body in palpable ways, sound is mined in many of these films for its viscerality: as one listens to extremities of acoustic proximity, frequency and volume, one’s own body responds in subconscious ways to those depicted and heard on screen” (2013, 117). These insights about sound are pertinent to Saving Private Ryan in that the Omaha Beach scene is designed to bombard the audience with the relentless onslaught of noise and action that the characters themselves face. In analysing this scene, we began with the hypothesis that “when the intensity of a background sound exceeds a certain threshold, mental activity can be paralyzed” (Augoyard and Torgue 2005, 41 qtd in Coulthard 2013, 118). In other words, we questioned whether the frenzied barrage of sound might cause a form of sensory-cognitive overload that could affect typical patterns of perceptual processing.

The eye tracking results did not reveal a significant pattern for scenes where we predicted this would occur. It would be helpful to explore this in the future with other physiological or neuro-measures that are better at identifying moments of cognitive overload or resource allocation. What does seem to emerge from our exploratory study is that even in films that firmly direct attention, as is characteristic of Spielberg’s directorial style, individual audience members bring their own complexity and experience to the viewing.[viii] Lang and colleagues point out that “complexity should be indexed not by how much of something is in the message but rather by how many human processing resources will be consumed when the message is processed” (Lang et al. 2014, 2). With respect to understanding the specific effects of what individual viewers bring with them to the screen, or teasing out how the audience is affected when watching footage that uses hand-held camera, induces cognitive overload, invokes the acoustic imagination, or uses indistinct dialogue, we conclude that this eye tracking study has raised fruitful questions that may best be answered by an approach that includes multiple measures provided by electroencephalograms, pupillometry or galvanic skin response techniques, as well as eye tracking technologies.

The “Indistinct Dialogue” and “Sand Dunes” sequences have what Sobchack terms a larger number of “‘synch points’ (‘salient’ and ‘prominent’ moments of audio-visual synchrony),” such as lines of dialogue, bullets pinging off metal and mortar shells landing, and these sonic cues “are firmly attached in a logically causal—and conventionally ‘realistic’—relation to the image’s specificity” (2005, 6–7). These synchronised sounds are “not as acousmatically imaginative and generative” as we contend that the subjective sound in the “Wounded Man” sequence is because the sounds appear to be “generated from the physical action” seen on the screen (Sobchack 2005, 7). In the “Indistinct Dialogue” and “Sand Dunes” conversation sequences, our eye tracking experiment did not necessarily reveal greater attentional focus with dialogue. While this counters what Coutrot and colleagues found in their 2012 study and Smith’s finding that sound reinforces visual synchrony (Smith 2013), it is in line with our expectation that the unconventional use of indistinct dialogue and chaotic background sound and imagery would disperse attention. Perhaps dialogue is not something that focuses visual attention, but rather something that focuses engagement. When the dialogue is clear, the viewer is able to look around the screen and absorb other cues about context. Precisely because the linguistic meaning is clear, such expository dialogue does not require as many cognitive resources to process and leaves some free for assessing other audio-visual cues. However, when the dialogue is indistinct, the viewer must then use other cues to work out the importance of the speech; in such cases the audience is essentially in the same position as watching without sound—although they may even be worse off in terms of cognitive resource allocation because there is also a barrage of other sound being processed in concert with the visual stimuli.

Overall, our use of eye tracking in conjunction with aesthetic analysis in our investigation of Saving Private Ryan has supported Coutrot and colleagues’ 2012 findings that dispersion (the degree of variability between observers’ eye positions) was lower with sound than without, so sound generally acted to concentrate perceptual attention. However, unlike Coutrot et al., we teamed eye tracking with qualitative film analysis to explore the effect of aesthetic variation and individual differences on gaze patterns as well as to identify common psychophysiologically governed patterns of attention. In this exploratory study, we found that differences in aesthetic techniques within segments of footage in the same film scene do make a difference to the audience’s gaze patterns and attentional fixation, and we found that within these patterns individual subjects exhibited divergent perceptual processes as well. Although our study is more restricted than comparable work undertaken by Coutrot and others, our attention to screen aesthetics and to variations in subjects’ responses within a single scene affords our method broader explanatory power than a study that excludes outliers and looks for commonalities across a wide range of video styles and genres.



Alexander Nevsky. Directed by Sergei Eisenstein, 1938. Mosfilm, DVD.

Augoyard, Jean-Francois, and Henry Torgue. 2005. Sonic Experience: A Guide to Everyday Sounds. Translated by Andra McCartney and David Paquette. Montreal: McGill Queen’s University Press.

Berliner, Todd. 2010. Hollywood Incoherent: Narration in Seventies Cinema. Austin: University of Texas Press.

Bordwell, David. 2009. “Cognitive Theory.” In Routledge Companion to Philosophy and Film, edited by Paisley Livingston and Carl Plantinga. 356–367. London: Routledge.

Coulthard, Lisa. 2013. “Dirty Sound: Haptic Noise in New Extremism.” In The Oxford Handbook of Sound and Image in Digital Media, edited by Carol Vernallis, Amy Herzog and John Richardson. 115–126. New York: Oxford University Press.

Coutrot, Antoine, Gelu Ionescu, Nathalie Guyader and Bertrand Rivet. “Audio Tracks do not Influence Eye Movements when Watching Videos.” Paper presented to the 34th European Conference on Visual Perception, Toulouse, France August 30, 2011.

Coutrot, Antoine, Nathalie Guyader, Gelu Ionescu and Alice Caplier. 2012. “Influence of Soundtrack on Eye Movements During Video Exploration.” Journal of Eye Movement Research 5.5: 1–10.

Coutrot, Antoine and Nathalie Guyader. 2014. “How Saliency, Faces, and Sound Influence Gaze in Dynamic Social Scenes.” Journal of Vision 14.8: 5.

Duchowski, Andrew T. 2007. Eye Tracking Methodology Theory and Practice. Dordrecht, Springer.

Gallese, Vittorio. 2013. “Mirror Neurons, Embodied Simulation and a Second-person Approach to Mind-reading.” Cortex in press: 1–3. Accessed August 28, 2014,

Gallese, Vittorio and Michel Guerra. 2012. “Embodying Movies: Embodied Simulation and Film Studies.” Cinema: Journal of Philosophy and the Moving Image 3: 183–210.

Hasson, Uri, Ohad Landesman, Barbara Knappmeyer, Ignacio Vallines, Nava Rubin and David J. Heeger. 2008. “Neurocinematics: The Neuroscience of Film” Projections 2.1: 1-26.

Heimann, Katrin, Maria Alessandra Umiltà, Michele Guerra and Vittorio Gallese. 2014. “Moving Mirrors: A High-density EEG Study Investigating the Effect of Camera Movements on Motor Cortex Activation during Action Observation.” Journal of Cognitive Neuroscience 26.9: 2087–2101.

Kozloff, Sarah. 2000. Overhearing Film Dialogue. Berkeley: University of California Press.

Land, Michael, Neil Mennie and J. Rusted. 1999. “The Roles of Vision and Eye Movements in the Control of Activities of Daily Living.” Perception 28.11: 1311–1328.

Lang, Annie. 2000. “The Limited Capacity Model of Mediated Message Processing.” Journal of Communication 50.1: 46–70.

Lang, Annie, Shuhua Zhou, Nancy Schwartz, Paul D. Bolls and Robert F. Potter. 2000. “The Effects of Edits on Arousal, Attention, and Memory for Television Messages: When an Edit is an Edit Can an Edit be too Much?” Journal of Broadcasting & Electronic Media 44.1: 94–109.

Lang, Annie, Ya Gao, Robert F. Potter, Seungjo Lee, Byungho Park and Rachel L. Bailey 2014. “Conceptualizing Audio Message Complexity as Available Processing Resources.” Communication Research, published online before print. Accessed September 28, 2014, doi: 10.1177/0093650213490722

Marchant, Paul, David Raybould, Tony Renshaw and Richard Stevens. 2009. “Are you seeing what I’m seeing? An Eye-tracking Evaluation of Dynamic Scenes.” Digital Creativity 20.3: 153–163.

McGurk, Harry and John MacDonald. 1976. “Hearing Lips and Seeing Voices.” Nature 264.5588: 746–8. doi:10.1038/264746a0.

Mital, Parag, Tim J. Smith, Robin Hill and Jim Henderson. 2011. “Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion.” Cognitive Computing 3: 5–24

Plantinga, Carl. 2009. Moving Viewers: American Film and the Spectator’s Experience. Berkeley: University of California Press.

Psycho. Directed by Alfred Hitchcock, 1960. Shamley Productions, DVD.

Rear Window. Directed by Alfred Hitchcock, 1954. Paramount, DVD.

Redmond, Sean, Sarah Pink, Jane Stadler, Jenny Robinson, Andrea Rassell and Darrin Verhagen. 2015 (forthcoming). “Seeing, Sensing Sound: Eye Tracking Soundscapes in Saving Private Ryan and Monsters Inc.” In Making Sense of Cinema: Empirical Studies into Film Spectators and Spectatorship, edited by CarrieLynn D. Reinhard and Christopher J. Olson. New York: Bloomsbury.

Remael, Aline. 2003. “Mainstream Narrative Film Dialogue and Subtitling.” The Translator 9.2: 225–247.

Saving Private Ryan. Directed by Steven Spielberg. 1998. Dreamworks/Paramount. DVD.

Shimamura, Arthur, ed. 2013. Psychocinematics: Exploring Cognition at the Movies. New York: Oxford University Press.

Sita, Jodi. 2014. Personal Communication. 19 June 2014. Australian Catholic University: Victoria, Australia.

Smith, Tim J. 2014. “Audiovisual Correspondences in Sergei Eisenstein’s Alexander Nevsky: A Case Study in Viewer Attention.” In Cognitive Media Theory (AFI Film Reader), edited by Paul Taberham and Ted Nannicelli. 85–105. New York: Routledge.

Smith, Tim J. 2013. “Watching You Watch Movies: Using Eye Tracking to Inform Cognitive Film Theory.” In Psychocinematics: Exploring Cognition at the Movies, edited by Arthur P. Shimamura. 165–191. New York: Oxford University Press.

Sobchack, Vivian. 2005. “When the Ear Dreams: Dolby Digital and the Imagination of Sound.” Film Quarterly 58.4: 2–15.

Song, Guanghan, Denis Pellerin and Lionel Granjon. 2011. “Sound Effect on Visual Gaze When Looking at Videos.” In 19th European Signal Processing Conference. 2034–2038. Barcelona: EUSIPCO 2011.

Tatler, Benjamin. 2014. “Eye Movements from Laboratory to Life.” Current Trends in Eye Tracking Research, edited by Mike Horsley, Matt Eliot, Bruce Allen Knight and Ronan Reily. 17–35. London: Springer.

Võ, Melissa, Tim J. Smith, Parag Mital and John Henderson. 2012. “Do the Eyes Really Have it? Dynamic Allocation of Attention when Viewing Moving Faces.” Journal of Vision. 12.13(3): 1–14

Yip, Man-Fung. 2014. “In the Realm of the Senses: Sensory Realism, Speed, and Hong Kong Martial Arts Cinema.” Cinema Journal 53.4: 76–97.


List of figures

Figure 1: Nine Areas of Interest (AOI), Saving Private Ryan: Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

Figure 2a: Gaze plots of four subjects for “Indistinct Dialogue” (Sequence 1): Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

Figure 2b: Gaze plots of four subjects for “Indistinct Dialogue” (Sequence 1): Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

Figure 3: Areas of Interest for subject 2 in “Wounded Man” (Sequence 2): Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

Figure 4: Total fixation duration in nine AOIs for “Wounded Man” (all subjects): Graphs produced by the authors

Figure 5: Areas of Interest for subject 4 in “Sand Dunes: In Command” (Sequence 3): Still from Saving Private Ryan (Steven Spielberg, 1998), data illustrated by the authors

Figure 6: Total fixation duration for nine AOIs in “Sand Dunes: In Command” (all subjects): Graphs produced by the authors



[i] The two preliminary sound-based eye tracking studies preceding Coutrot et al’s 2012 publication are a conference presentation by Coutrot et al. (2011), and a conference paper by Song, Pellerin, and Granjon (2011). However, in 2012 Melissa Võ and colleagues also published a study that investigated the effects on attention to faces in videos when the auditory speech track was removed. This study found that when speech was not present, observers’ gaze allocation changed: they looked more at the scene background and decreased fixations to faces generally and especially decreased concentration on the mouth region (Võ et al. 2012, 12).

[ii] A study of everyday attention indicates that people exhibit visual search behaviours that anticipate, locate, and monitor action, which is evidence of top down influences on visual perception (see Land et al. 1999).

[iii] Tim Smith states that “The degree of attentional synchrony observed for a particular movie frame will vary depending on whether it is from a Hollywood feature film or from unedited real-world footage, the time since a cut and compositional details such as focus or lighting but attentional synchrony will always be greater in moving images than static images” (2014, 90).

[iv] The lip-reading phenomenon is called the “McGurk effect” (see McGurk 1976).

[v] For further discussion of central areas of interest in Saving Private Ryan, see Redmond et al. (2015).

[vi] Established formulae for dispersion and other measures of individual variation in gaze pattern exist (e.g., Coutrot 2012). As an exploratory study, we were limited by both number of subjects and post hoc data analysis. This distribution estimate was a sufficient way to capture dominant and non-dominant viewing. However, we would recommend future research develop a better variance measure of asynchronous viewing, such as the Kullback-Lieber Divergence formula referred to above.

[vii] Note that similar results were obtained in a related study of a sequence earlier in the beach-landing scene that depicts Captain Miller’s experience of shellshock (Redmond et al. forthcoming 2015).

[viii] A neuroimaging study comparing responses to film clips ranging from a sequence directed by Alfred Hitchcock to a segment of actuality footage shot in Washington Square Park found that higher levels of aesthetic control generate greater viewer synchrony or inter-subject correlation in the audience’s viewing patterns and brain activity (Hasson et al. 2008, 15).



Dr Jennifer Robinson is Lecturer in Public Relations, School of Media and Communication at RMIT University. She authors industry reports and has published in J Advertising, BMC Public Health, J Interactive Marketing and the J Public Relations Research. Her media effects research investigates new media and media audiences using neuro-measures.

Jane Stadler is Associate Professor of Film and Media Studies, School of Communication and Arts at the University of Queensland. She is author of Pulling Focus: Intersubjective Experience, Narrative Film and Ethics, and co-author of Screen Media and Media and Society.

Andrea Rassell is a PhD student and Research Assistant in the School of Media and Communication at RMIT University. She has a professional background in both science and film and researches at the nexus of the two disciplines.

How We Came To Eye Tracking Animation: A Cross-Disciplinary Approach to Researching the Moving Image – Craig Batty, Claire Perkins, & Jodi Sita


In this article, three researchers from a large cross-disciplinary team reflect on their individual experiences of a pilot study in the field of eye tracking and the moving image. The study – now concluded – employed a montage sequence from the Pixar film Up (2009) to determine the impact of narrative cues on gaze behaviour. In the study, the researchers’ interest in narrative was underpinned by a broader concern with the interaction of top-down (cognitive) and bottom-up (salient) factors in directing viewers’ eye movements. This article provides three distinct but interconnected reflections on what the aims, process and results of the pilot study demonstrate about how eye tracking the moving image can expand methods and knowledge across the three disciplines of screenwriting, screen theory and eye tracking. It is in this way both an article about eye tracking, animation and narrative, and also a broader consideration of cross-disciplinary research methodologies.



Over the past 18 months, a team of cross-disciplinary researchers has undertaken a pilot eye tracking and the moving image study that has sought to understand where spectators look when viewing animation.[i] The original study employed eye tracking methods to record the gaze of 12 subjects. It used a Tobii X120 (Tobii Technology, 2005) remote eye tracking device which allowed viewers to watch the animation sequence on a widescreen PC monitor at 25 frames per second, with sound. The eye tracker pairs the movements of the eye over the screen with the stimuli being viewed by the participant. For each scene viewed, the researchers selected areas of interest; and for these areas, all of the gaze data, including the number and duration of each fixation, was collected and analysed.

Using a well-known montage sequence from the Pixar film Up! (2009), this pilot study focussed on narrative with the aim of discerning whether story cues were instrumental in directing spectator gaze. Focussing on narrative seemed to be useful in that as well as being an original line of enquiry in the eye tracking context, it also offered a natural connection between each of our disciplines and research experiences. The study did not take into account emotional and physiological responses from its participants as a way of discerning their narrative comprehension. Nevertheless, what we found from our data was that characters (especially their faces), key (narrative) objects and visual/scenic repetition seemed to be core factors in determining where they looked.[ii]

In the context of a montage sequence that spans around 60 years of story time, in which the death of the protagonist’s wife sets up the physical and emotional stakes of the rest of the film, it was clear that narrative meaning relating to a character’s journey/arc is important to viewers, more so (in this study) than peripheral action or visual style, for example. With regards to animation specifically, a form ‘particularly equipped to play out narratives that solicit […] emotions because of its capacity to illustrate and enhance interior states, and to express feeling that is beyond the realms of words to properly capture’ (Wells, 2007: 127), the highly controlled nature of the sequence from which the data was drawn seems to suggest that animation embraces narrative techniques fully to control viewer attention.

In this article, three researchers from the team – A, a screenwriter, B, a screen scholar and C, an eye tracking neuroscientist – discuss the approaches they took to conducting this study. Each of us came to the project armed with different expertise, different priorities and a different set of expectations for what we might find out, which we could then take back to our individual disciplines. In this article, then, we purposely use three voices as way of teasing out our understandings before, during and after the study, with the aim of better understanding the potential for cross-disciplinary research in this area. Although other studies in eye tracking and the moving image have been undertaken and reported on, we suggest that using animation with a strongly directed narrative as a test study provides new information. Furthermore, few other studies to date have brought together traditional and creative practice researchers in this way.

What we present, then, is a series of interconnected discussions that draw together ideas from each researcher’s community of thought and practice, guided by the overriding question: how did this study embrace methodological originality and yield innovative findings that might be important to the disciplines of eye tracking and moving image studies? We present these discussions in the format of individual reflections, as a way of highlighting each researcher’s contributions to the study, and in the hope that others will see the potential of disciplinary knowledge in a study such as this one.

How ‘looking’ features in our disciplines, and what we might expect to ‘see’

Researcher A: ‘Looking’ in screenwriting means two things: seeing and reflecting on. By this I mean that a viewer looks at the screen to see what is happening, whilst at the same time reflecting on what they are looking at from on a personal, cultural and/or political level. Some screenwriters focus on theme from the outset: on what they want their work to ‘say’ (see Batty, 2013); some screenwriters focus on plot: on what viewers will see (action) (see Vogler, 2007). What connects these is character. In Aristotelian terms, a character does and therefore is (Aristotle, 1996); for Egri, a character is and therefore does (Egri, 2004). The link here is that what we see on the screen (action) is always performed by a character, meaning that through a process of agency, actions are given meaning, feeding into the controlling theme(s) of the text. In this way, looking at – or seeing – is tied closely to understanding and the feelings that we bring to a text. As Hockley (2007) says, viewers are sutured into the text on an emotional level, connecting them and the text through the psychology of story space.

What we ‘see’, then, is meaning. In other words, we do not just see but we also feel. We look for visual cues that help us to understand the narrative unfolding before our eyes. With sound used to point to particular visual aspects and heighten our emotional states, we bestow energy and emotion in the visuality of the screen, in the hope that we will arrive at an understanding. As this study has revealed, examples include symbolic objects in the frame (the adventure book; the savings jar; the picture of Paradise Falls) that have narrative value in screenwriting because of the meaning they possess (Batty and Waldeback, 2008: 52-3). By seeing these objects repeated throughout the montage, we understand what they mean (to the characters and to the story) and glean a sense of how they will re-appear throughout the rest of the film as a way of representing the emotional space of the story.

Landscape is also something we see, though this is always in the context of the story world (see Harper and Rayner, 2010; Stadler, 2010). In other words, where is this place? What happens here? What cannot happen here? Characters belong to a story world, and therefore landscape also helps us to understand the situations in which we find them. This, again, draws us back to action, agency and theme: when we see landscape, we are in fact understanding why the screenwriter chose put their characters – and us, the audience – there in the first place.

Researcher B: In screen theory, looking is never just looking – never innocent and immediate. The act of looking is the gateway to the experience and knowledge of what is seen on screen, but also of how that encounter reflects the world beyond the screen and our place within it. Looking is over determined as gazing, knowing and being, endlessly charged by the coincidence of eye and I and of real and reel. Psychoanalytic theory imagines the screen as mirror and our identity as a spectatorial effect of recognizing ourselves in the characters and situations that unfold upon it, however refracted. Reception studies seeks out how conversely real individuals encounter content on screen, and how meaning sparks in that meeting—invented anew with every pair of eyes. Television studies emerges from an understanding of a fundamental schism in looking: where the cinematic apparatus enables a gaze, the televisual counterpart can (traditionally) only produce a broken and distracted glance.

All of these theories begin with the act of looking, and are enabled by it in their metaphors, methods and practices. But in no instance is looking attended to as anatomical vision – the process of the “meat and bones” body and brain rather than the metaphysical consciousness. As a scholar of screen theory, my base interest in eye tracking comes down to this “problem”. Is it a problem? Should the biology and theory of looking align? What effects and contradictions arise when they are brought together?

Phenomenological screen theory is a key and complex pathway into this debate, as an approach that values embodied experience, but discredits the ocular—seeking to bring the whole body to spectatorship rather than privilege the centred and distant subject of optical visuality (Marks, 2002: xvi). Vivian Sobchack names film ‘an expression of experience by experience … an act of seeing that makes itself seen, an act of hearing that makes itself heard’ (Sobchack, 1992: 3). Eye tracking shows us the act of seeing – the raw fixations and movements with which screen content is taken in. In the study under discussion here it is this data that is of central interest, with our key questions deriving from what such material can verify about how narrative shapes gaze behaviour. A central question and challenge for me moving forward in this field, though, is to consider this process without ceding to ocularcentrism: that is, without automatically equating seeing to knowing. This ultimately means being cautious about reading gaze behaviour as ‘proof’ of what viewing subjects are thinking, feeling and understanding. This approach will be supported by the inclusion of further physiological measurements.

Researcher C: Interest in vision and how we see the world is an age-old interest, where it has been commonly held that the eyes are the windows to the mind. Where we look is then of great importance, as learning this offers us opportunities to understand more about where the brain wants to spend its time. Human eyes move independently from our heads and so our eyes have developed a specialised operating systems that both allows our eyes to move around our visual environment, and also counteract any movements the head may be making. This has led to a distinct set of eye movements we can study – saccades (the very fast blasts of movement that pivot our eye from focus point to focus point) – and fixations (brief moments of relative stillness where our gaze stops for a moment to allow the receptors in our eye to collect visual information). In addition, only a tiny area of the back of our eyeball, the fovea on the retina, is sensitive enough to gather highly ‘acuitive’ information, thus the brain must drive the eye around precisely in order to get light to fall onto this tiny area of the eye. As such, our eyes movements are an integral and essential part of our vision system.

Eye movement research has seen great advances during the last 50 years, with many early questions examined in the classic work of Buswell (1935) and Yarbus (1967). One question visual scientists and neuroscientists have been, and are still keen to, explore is why we look where we do: what is it about the objects or scene that draws our visual attention? Research over the decades has found that several different aspects are involved, relating to object salience, recognition, movement and contextual value (see Schütz et al., 2011). For animations that are used for learning purposes, Schnotz and Lowe (2008) discussed two major contributing factors that influence the attention-grabbing properties of features that make up this form. One is visuospatial contrast and a second is dynamic contrast; with features that are relatively large, brightly coloured or centrally placed, more likely to be fixated on compared to their less distinctive neighbours; and features that move or change over time drawing more attention.

Eye tracking research, which is now easier than ever to conduct, allows us to delve into examining how these and other features influence us, and is a unique way to gain access to the windows of the mind. Directing this focus to learning more about how we watch films, and in particular to animation, is what drove me to wanting to use eye tracking to better see how people experience these; and to delve into questions such as, what are people drawn to look at, and how might things like the narrative affect the way we direct our gaze?

When looking around a visual world, our view is often full of different objects and we tend to drive our gaze to them so we can recognize, inspect or use them. Not so surprisingly, what we are doing (our task at hand) strongly affects how we direct our gaze; such that as we perform a task, our salience-based mechanisms seem to go offline as people almost exclusively fixate on the task-relevant objects (Hayhoe, 2000; Land et al., 1999). From this, one expectation we have when considering how viewers watch animation is that more than salient features, aspects relating to the narrative components of the viewer’s understanding of the story will be the stronger drive. Another well-known drawcard for visual attention is towards faces, which tend to draw the eye’s attention very strongly (Cerf et al., 2009; Crouzet et al., 2010). For animated films we were interested to see if similar effects would be observed.

Finally, another strong and interesting effect that has been discussed is a tendency for people to have a central viewing bias, in which a large effect on viewing behaviour has been shown to be that people tend to fixate in the centre of a display (Tatler and Vincent, 2009). As this study was moving image screen based, we were keen to compare different scenes and how the narrative affected this tendency.

How we came to the project, and what we thought it might reveal

Researcher A: From a screenwriting perspective, I was excited to think that at last, we might have data that not only privileges the story (i.e., the screenwriter’s input), but that also highlights the minutiae of a scene that the screenwriter is likely to have influenced. This can be different in animation than in live action, whereby a team of story designers and animators actively shape the narrative as the ‘script’ emerges (see Wells, 2010). Nevertheless, if we follow that what we see on screen has been imagined or at least intended by a ‘writer’ of sorts – someone who knows about the composition of screen narratives – then it was rousing to think that this study might provide ‘evidence’ to support long-standing questions (for myself at least) of writing for the screen and authorship. Screenwriters work in layers, building a screenplay from broad aspects such as plot, character and theme, to micro aspects such as scene rhythm, dialogue and visual cues. Being able to ‘prove’ what viewers are looking at, and hoping that this might correlate with a screenwriting perspective of scene composition, was very appealing to me.

I was also interested in what other aspects of the screen viewers might look at, either as glances or as gazes. In some genres of screenwriting, such as comedy, much of the clever work comes around the edges: background characters; ironic landscapes; peripheral visual gags, etc. From a screenwriting perspective, then, it was exciting to think that we might find ways to trace who looks at what, and if indeed the texture of a screenplay is acknowledged by the viewer. The study would be limited and not all aspects could be explored, but as a general method for screen analysis, simply having ideas about what might be revealed led to some very interesting discussions within the team.

Researcher B: All screen theories rest upon a fundamental assumption that different types of content, and different viewing situations, produce different viewing behaviours and effects. Laura Mulvey’s famous theory of the gaze stipulates that classical Hollywood cinema and the traditional exhibition environment (dark cinema, large screen, audience silence) position men as bearers of the look and women as objects of the look, and that avant-garde cinemas avoid this configuration (Mulvey, 1975). New theories of digital cinema speculate upon whether a spectator’s identification with an image is altered when it bears no indexical connection to reality; that is, when the image is a simulated collection of pixels rather than the trace of an event that once took place before a camera (Rodowick, 2007). The phenomenological film theory of Laura Marks suggests that certain kinds of video and multimedia work can engender haptic visuality, where the eyes function like ‘organs of touch’ and the viewer’s body is more obviously involved in the process of seeing that is the case with optical visuality (Marks, 2002: 2-3). It made sense to begin our study into eye tracking by thinking about these different assumptions regarding content and context and formulating methods to analyse them empirically.

For our first project we chose to focus on an assumption regarding spectatorship that is more straightforward and essential than any listed above: namely that viewers can follow a story told only in images. This is an assumption that underpins the ubiquitous presence of the montage sequence in narrative filmmaking, where a large amount of story information is presented in a short, dialogue-free sequence. We hypothesized that by tracking a montage sequence we would be able to ascertain if and how viewers looked at narrative cues, even when these are not the most salient (i.e., large, colourful, moving) features in the scene. The study was in this way designed to start investigating how much film directors and designers can control subjects’ gaze behaviour and top-down (cognitively driven) processes.

The sequence from Up! was chosen in part to act as a ‘control’ against which we could later assess different types of content. The story told in the 4-minute sequence is complex but unambiguous, with its events and emotive power linked by clear relationships of cause and effect. It is in this way a prime example of a classical narrative style of filmmaking, where the emphasis is on communicating story information as transparently as possible (Bordwell, 1985: 160). Our hypothesis was that subjects’ gaze behaviour would be controlled by the tightly directed sequence with its strong narrative cues, and that this study could thereby function as a benchmark against which different types of less story-driven material could be compared later.

Researcher C: A colleague and I set up the Eye Tracking and the Moving Image (ETMI) research group in 2012, following discussions around how evidence was collected to support and investigate current film theory. These conversations grew into a determination to begin a cross-disciplinary research group, initially in Melbourne, to begin working together on these ideas. I had previously been involved in research using eye tracking to study other dynamic stimuli such as decision making processes in sport and the dynamics of signature forgery and detection, and my experience led to a belief that the eye tracker could have enormous potential as a research tool in the analysis and understanding of the moving image. Work on this particular study was inspired by the early aims of a subgroup (of which the other authors are a part), whose members were interested to investigate, in a more objective manner, the effect that narrative cues had on viewer gaze behaviour.

Existing research in our disciplines, and how that influenced our approaches to the study

Researcher A: While there had been research already conducted on eye tracking and the moving image, none of it had focussed on the creational aspects of screen texts: what goes into making a moving image text, before it becomes a finished product to be analysed. Much like screen scholarship that studies in a ‘post event’ way, what was lacking – usefully for us – was input from those who are practitioners themselves. The wider Melbourne-based Eye Tracking and the Moving Image research group within which this study sits has a membership that includes other practitioners, including a sound designer and a filmmaker. Combined, this suggested that our approach might offer something different; that it might ‘do more’ and hopefully speak to the industry as well as other researchers. As a screenwriter, the opportunity to co-research with scholars, scientists and other creative practitioners was therefore not only appealing, but also methodologically important.

As already highlighted, it was both an academic and a practical interest in the intersection of plot, character and theme that underpinned my approach. As Smith has argued, valuing character in screen studies has not always been possible (1995); moving this forward, valuing character, and in particular the character’s journey, has recently become more salient (see Batty, 2011; Marks, 2009), adding weight to a creative practice approach to screen scholarship. In this way, understanding the viewer’s experience of the screen seemed to lend itself well to some of the core concerns of the screenwriter; or to put it another way, had the ability to test what we ‘know’ about creative practice, and the role of the practitioner. Feeding, then, into wider debates about the place of screenwriting in the academy (see Baker, 2013; Price, 2013; 2010), it was important to value the work of the screenwriter, and in a scholarly rigorous – and hopefully innovative – way.

Researcher B: The majority of research on eye tracking and the moving image to date has been designed and undertaken as an extension to cognitive theories of film comprehension. Deriving from the constructivist school of cognitive psychology, and led by film theorist David Bordwell, this approach argues that viewers do not simply absorb but construct the meaning of a film from the data that is presented on screen. This data does not constitute a complete narrative but a series of cues that viewers process by generating inferences and hypotheses (Elsaesser and Buckland, 2002: 170). Bordwell’s approach explicitly opposes psychoanalytic film theory by attending to perceptual and cognitive aspects of film viewing rather than unconscious processes. Psychologist Tim Smith has mobilized eye tracking in connection with Bordwell’s work to demonstrate how this empirical method can “prove” cognitive theories of comprehension—showing that subjects’ eyes do fixate on those cues in a film’s mise-en-scène that the director has controlled through strategies of staging and movement (Smith, 2011; 2013).

The Up study was designed to follow in the wake of Smith’s work, with a particular interest in examining the premise of Bordwell’s theory – which is that narration is the central process that influences the way spectators understand a narrative film (Elsaesser and Buckland, 2002: 170). With this in mind, we deliberately chose a segment from an animated film, where the tightly directed narrative of the montage sequence is competing with a variety of other stimuli that subjects’ eyes could plausibly be attracted to: salient colourful and visibly designed details in the background and landscape of each shot.

We were also interested in this montage sequence for the highly affecting nature of its mini storyline, which establishes the protagonist Carl’s deep love for his wife Ellie as the motivation for his journey in Up! itself. The sequence carries a great deal of emotive power by contrasting the couple’s happiness in their long marriage with Carl’s ultimate sadness and regret at not being able to fulfill their life-long dream of moving to South America before Ellie falls sick and dies. Would it be possible to ‘see’ this emotional impact in viewers’ gaze behaviour?

How we reacted to the initial data, and what it was telling us.

Researcher A: When looking at data for the first time, I certainly saw a correlation between what we know about screenwriting and seeing, and what we could now turn to as evidence. For example, key objects such as the adventure book, the savings jar (see Fig. 1) and the picture of Paradise Falls – all of which recurred throughout the montage sequence – were looked at by viewers intensely, suggesting that narrative meaning was ‘achieved’.

Fig. 1. A heat map showing the collective intensity of viewers’ responses to the savings jar.

Fig. 1. A heat map showing the collective intensity of viewers’ responses to the savings jar.

As another example, when characters were purposely (from a screenwriting perspective) separated within the frame of the action, viewers oscillated between the two, eventually settling on the one they believed to possess the most narrative meaning (see Fig. 2). This further implied the importance of the character journey and its associated sense of theme, which for screenwriting verifies the careful work that has gone into a screenplay to set up narrative expectations.

Fig. 2. A gaze plot showing the fixations and saccades of one viewer in a scene with the prominent faces of Carl and Ellie.

Fig. 2. A gaze plot showing the fixations and saccades of one viewer in a scene with the prominent faces of Carl and Ellie.

Researcher B: We chose to analyse the data on Up! by examining how viewer attention fluctuated in focus between Carl and Ellie across the course of the montage sequence. The two are equal agents in the narrative at the beginning, but the montage’s story unfolds through the action and behaviour of each as it continues – that is, each character carries the story at different points. Overwhelmingly, the data supported this narrative pattern by showing that the majority of viewers fixated on the character who, moment by moment, functions as the agent of the story, even when that figure is not the most salient aspect of the image. Aligning with Bordwell’s cognitive theory of comprehension, this data confirms that viewers do rely principally on narrative cues to understand a film. As a top-down process of cognition, narrative exerts control over viewer attention to keep focus on the story rather than let the gaze wander to other bottom-up (salient) details in the mise-en-scène. It is this process that allowed Smith to show that viewers overwhelmingly will not notice glaring continuity errors on screen (Smith, 2005). As in the famous ‘Gorillas in our Midst’ experiment (Simons and Chabris, 1999), viewer attention is focused so closely on employing narrative schema to spatially, temporally and causally linked events that the salient stimuli on screen appears to be completely missed.

Researcher C: Initially I was quite interested to see the attention paid to faces, and in particular, characters’ eyes and mouths. Being animation, I had been keen to see if similar elements of faces would draw viewers’ eyes in the same ways that we look at human faces, where eyes and mouths are most viewed (Crouzet, et al., 2010). Here, even though the characters were not engaging in dialogue, their mouths as well as their eyes were still searched. Looking at eyes has been linked to looking for contextual emotional information (Guastella et al., 2007), and so with this montage sequence being non-verbal, it was not surprising to see much of the focus on characters’ eyes as viewers attempted to read the emotion though them (see Fig. 3).

Fig, 3. Two viewers’ gaze plots depicting the sequence of fixations made between Carl and Ellie.

Fig, 3. Two viewers’ gaze plots depicting the sequence of fixations made between Carl and Ellie.

Other areas I was interested to observe were instances when other well-known features drew strong viewer attention, such as written text and bright (salient) objects. Two particular scenes we examined contained examples of these. In one scene, in which the savings jar sits at the back of a dark bookshelf, viewers were both drawn to look at the bright candle in the foreground and also to the savings jar. The jar was in the dark, however with narrative cues to draw attention to it as well as the fact that it contained text, viewers were drawn to look at it (see Fig. 1). Surprisingly, in this scene other interesting objects are easily discernible – a wooden colourful bird figure; a guitar; a compass – yet the savings jar as well as the bright candles were viewed. The contextual information, the text and the salience appear to be working here to drive the eye, all within a few seconds of time.

Fig. 4. Gaze plots of fixations made by all viewers over the scene in which Carl purchases airline tickets.

Fig. 4. Gaze plots of fixations made by all viewers over the scene in which Carl purchases airline tickets.

The second scene to see text working as a cue for the eye was in the travel shop scene (Fig. 4). Here, viewers were drawn to look at two text-based posters placed on the back wall of the shop. Again, this scene was only shown momentarily, yet glances towards the text and images, as well as the exchange between the characters, give viewers the elements of the story they need to glean so that they know what is going on, and where the story will go next (Carl’s surprise for Ellie).

How over time we better understood the data, and what we began to know more

Researcher A: I was interested to see that some viewers spent time looking at the periphery. The Up! montage sequence did not necessarily offer ‘alternative’ layers in the margins of the screen, though given its created and controlled animated nature, it perhaps should not be a surprise that away from the centre of the screen there were visual delights, such as the sun setting over the city and a blanket of clouds that changed shape, from clouds to animals to babies. This suggested to me that in animation, because viewers know that images have been created from scratch, there is an expectation that the screen will offer a plethora of experiences, from narrative agency to visual amplification. This, in turn, suggested that in further studies, it might be useful to contrast texts that use the potential of the full screen to engage viewers with those that go in close and privilege the centre. Genre would most likely play a key role in this future endeavour.

Researcher B: As hoped, this pilot study has been instructive as a base from which we can now expand. It has raised many questions. One issue is that this data cannot ‘prove’ subjects were not seeing those elements on-screen that were not fixated upon – were they perhaps seeing them peripherally? This could only be confirmed by conducting interviews after the eye tracking takes place, and could instructively inform an understanding of how story information that is layered in the mise-en-scène (for instance in setting, lighting and costume) contributes to overall narrative comprehension. We are also very interested to determine how the context of viewing affects gaze behaviour. For instance, would subjects still fixate overwhelmingly on narrative cues when watching this sequence in a cinema environment on a large – even an IMAX – screen? In this environment the image on screen is larger and the texture more palpable. Would viewers here perhaps be more focused on these salient pleasures of the image and engage in a different, less cognitive experience of the film; letting their eyes roam across the grain of the shot in its colours, shapes and surfaces? Would results alter between an animated and live action film? Psychoanalytic film theory tells us that the cinematic apparatus promotes identification with characters and, by extension, the ideologies of the social system from which they are produced (Mulvey, 1975). Eye tracking can potentially intervene in this powerful theory of spectatorship by showing if and how viewers do fixate on the cues that give rise to this interpellation.

Researcher C: After looking at some of early scene analyses, I was somewhat surprised by how many eye movements could be made in fleetingly fast scenes, and at how many items in these scenes one could fixate on, if only briefly. I had expected viewers to be taking in some of the surrounding items in a scene using their peripheral vision, and to see more of the centralisation bias (Tatler and Vincent, 2009). Yet for some scenes, in particular for the two scenes in which Carl purchases the surprise airline tickets (see Figs 4 and 5), we see how viewers were drawn to search for narrative clues by looking around the scene.

Fig. 5. Gaze plot showing the fixations made by all viewers as they briefly see the contents of the picnic basket.

Fig. 5. Gaze plot showing the fixations made by all viewers as they briefly see the contents of the picnic basket.

In the first scene (see Fig. 4), Carl in seen in a shop, facing the shop assistant. Viewers had previously seen him in the midst of coming up with a bright idea. This scene thus gives the viewer a chance to work out what his idea was. What can be seen is that most viewers scanned the surrounds for clues. A similar pattern is seen in the next scene, in which we quickly glance at the contents of a picnic basket being carried by Carl (see Fig. 5). In the basket, which is seen close up, viewers scan the basket’s contents. It contains picnic items and the surprise airline ticket, and even though some glances went to other basket items, it was the ticket that captured most of the attention; the item that held the most narrative information. This item was also the most salient, being the clearest and brightest item in the basket, and, importantly, the only item to contain written text. In a very short glimpse of a scene, these features almost ensured that viewers’ eyes were directed to look at and acknowledge the ticket.

What excites us about the future of work in this area, and where we think it might take our own disciplines

Researcher A: If we are to fully embrace the creative practice potential of studies such as this, then we might look to creating new texts that can then be studied. If, in 1971, Norton and Stark created simple drawings to test how their subjects recognised and learned patterns, then over 40 years later, our approach might be to develop a short moving image narrative through which we can test our viewers’ gaze. For example, if we were to develop a short film and play it out of sequence (i.e., narrative meaning altered), might we affect where viewers look? Might they look differently: in different places and for different lengths of time? Similarly, what if we were to musically score a text in different ways, diegetically and non-diegetically? Might we affect the focus of viewer gaze? If so, what might this tell us about narrative attention and filmmaking techniques that sit ‘beyond the screenplay’?

For screenwriting as a discipline, studies such as these would serve two purposes, I feel. Firstly, they would help to strengthen the presence of screenwriting in the academy, especially in regard to innovative research that privileges the role of the practitioner. Accordingly, these studies could provide a variety of methodological approaches that might be of use to other screenwriting scholars; or that might be applied to other creative practice disciplines, in which researchers wish to understand the work that has gone into the creation of a text that might otherwise only be studied once it has been completed. Secondly, and perhaps more importantly, such studies might yield results that benefit, or at least inform, future screenwriting practices. Whether industry-related practices or otherwise, just like all ‘good’ creative practice research, the insights and understandings gained would contribute to the discipline in question in the form of ‘better’ or ‘different’ ways of doing (Harper, 2007). For me, this would reflect both the nature and the value of creative practice research.

Researcher B: All of the potential avenues for future research in this field take an essential interest in how moving images on screen produce a play between top-down and bottom-up cognition. In this, a larger issue for me – going back to the points I raised at the beginning of my section – is how the data can be mobilized beyond a strictly cognitive framework and vocabulary of screen theory. As indicated, the cognitive approach offers a deliberately ‘common sense’ counterpart to a paradigm such as psychoanalysis, with its reliance on myth, desire and fantasy (Elsaesser and Buckland, 2002: 169). Cognitive theory understands a film as a data set that a viewer’s brain processes and completes in an active construction of meaning – an understanding that eye tracking and neurocinematics is very well placed to support and expand. But most screen scholars appreciate and theorize film and television texts as much more than mere sets of data. The moving image is an experience that only ‘works’ by generating emotional affect, by engaging the viewer’s attachments, memories, desires and fears. Film theorist Linda Williams proposes that our investment in following the twists and turns of a narrative is fundamentally reliant upon the emotion of pathos: we continually, pleasurably invest in the expectation that a character will act or be acted upon in such a way that they achieve their goal, and continually, pleasurably have that expectation obscured and dashed by the story (Williams, 1998). So viewer attention is driven not just by a drive to know but also by a desire to feel: to be swept up in waves of hope and disappointment.

The mini storyline of the Up! montage sequence relies entirely on this dialectic of action and pathos. Carl and Ellie’s hopes are repeatedly frustrated, and Carl is finally unable to redeem this pattern before Ellie dies – producing a profound sense of pathos and regret as the defining theme of the sequence. We can see that our subjects’ fixations fell in line with this pattern as the sequence unfolded, consistently focusing on the character who was triggering or carrying the emotional power. But how do we distinguish the ‘felt’ dimension of this gaze out from the viewer’s efforts to simply comprehend what is happening by following characters’ movements, facial expressions or body language? How, that is, can we ‘see’ emotional engagement, and start to appreciate how this crucial dimension of spectatorship – based on feeling not thinking – governs the play between top-down and bottom-up cognition in moving pictures? For me, grappling with this problem – and perhaps experimenting with further measurements of pupil dilation, heart rate and brain activity – offers a fascinating pathway into understanding how eye tracking can move beyond an engagement with cognitive film theory to contribute to phenomenological thinking on genuinely embodied seeing and experience.

Researcher C: There is so much that can be done in this area, and that makes it an exciting pursuit; yet what makes it even more motivating is the way that we hope to go about it: collaboratively. One of the core aspects that members of ETMI are very passionate about is working together, bringing in different fields, different disciplines, different ways of seeing things, and building bridges between them. This work is not only about learning more about how we watch and interact with films, but also about having different perspectives on those insights. Work I would personally like to see undertaken in this way is to explore how black and white viewing compares to colourised viewing, and to explore whether and how 3D viewing affects how we gaze about a scene. To compare the gaze and emotional responses of children and adults to the same visual content, and similarly compare visual and emotional responses to material between males and females, and between genre fans and haters, is also an interesting possibility.

Finally, adding to these, I am excited about the potential collection and analysis of other physiological measures to better gauge emotional engagement. These include blood pressure, pupillometry, skin conduction, breathing rate and volumes, heart rate, sounds made (gasps, holding breath, sighs etc.) and facial expressions made.


By reflecting on each of our research backgrounds, experiences and expectations, what this article has revealed is that while we might have all come to the study with varied approaches and intentions, we have come out of the study with a somewhat surprisingly harmonious set of observations and conclusions. Without knowing it, perhaps, we were all interested in narrative and the role that characters play in the agency of it. We were also similarly interested in landscape and the visual potential of the screen; not in an obvious way, but in relation to subtext, meaning and emotion. The value of a study like this, then, lies not just in its methodological originality, but also in its ability to stir up passions in cross-disciplinary researchers, whereby each can bring to the table their own skills and ways of understanding data to reach mutual and respective conclusions. Although we ‘knew’ this from undertaking the study, the opportunity to reflect fully on the process in the form of an article has given us an even greater understanding of the collaborative potential of cross-disciplinary researchers such as ourselves.



Aristotle. (1996). Poetics. Trans. Malcolm Heath. London: Penguin.

Baker, Dallas. (2013). Scriptwriting as Creative Writing Research: A Preface. In: Dallas Baker and Debra Beattie (eds.) TEXT: Journal of Writing and Writing Courses, Special Issue 19: Scriptwriting as Creative Writing Research, pp. 1-8.

Batty, Craig, Adrian G. Dyer, Claire Perkins and Jodi Sita. (Forthcoming). Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative. In: CarrieLynn D. Reinhard and Christopher J. Olson (eds.). Making Sense of Cinema: Empirical Studies into Film Spectators and Spectatorship. New York: Bloomsbury.

Batty, Craig. (2013) Creative Interventions in Screenwriting: Embracing Theme to Unify and Improve the Collaborative Development Process. In: Shane Strange and Kay Rozynski. (eds.) The Creative Manoeuvres: Making, Saying, Being Papers – the Refereed Proceedings of the 18th Conference of the Australasian Association of Writing Programs, pp. 1-12.

Batty, Craig. (2011). Movies That Move Us: Screenwriting and the Power of the Protagonist’s Journey. Basingstoke: Palgrave Macmillan.

Batty, Craig and Zara Waldeback. (2008). Writing for the Screen: Creative and Critical Approaches. Basingstoke: Palgrave Macmillan

Bordwell, David. (1985). Narration in the Fiction Film. London: Routledge.

Buswell Guy. T. (1935). How People Look at Pictures. Chicago: Chicago University Press.

Cerf, Moran, E. Paxon Frady and Christof Koch. (2009). Faces and text attract gaze independent of the task: Experimental data and computer model. Journal of Vision, 9(12): 10, pp. 1–15.

Crouzet, Sebastien M., Holle Kirchner and Simon J. Thorpe. (2010). Fast saccades toward faces: Face detection in just 100 ms. Journal of Vision, 10(4): 16, pp. 1–17.

Egri, Lajos. (2004). The Art of Dramatic Writing. New York: Simon & Schuster.

Elsaesser, Thomas and Warren Buckland. (2002). Studying Contemporary American Film: A Guide to Movie Analysis. London: Hodder Headline.

Guastella, Adam J., Philip B. Mitchell and Mark R Dadds. (2008). Oxytocin increases gaze to the eye region of human faces. Biological Psychiatry, 63, pp. 3-5.

Harper, Graeme and Jonathan Rayner. (2010). Cinema and Landscape. Bristol: Intellect.

Harper, Graeme. (2007). Creative Writing Research Today. Writing in Education, 43, p. 64-66.

Hayhoe, Mary. (2000). Vision using routines: A functional account of vision. Visual Cognition, 7, pp. 43–64.

Hockley, Luke. (2007). Frames of Mind: A Post-Jungian Look at Cinema, Television and Technology. Bristol: Intellect.

Land, Michael F., Neil Mennie and Jennifer Rusted. (1999). The roles of vision and eye movements in the control of activities of daily living. Perception, 28, pp. 1311–1328.

Marks, Dara. (2009). Inside Story: The Power of the Transformational Arc. London: A&C Black

Marks, Laura U. (2002). Touch: Sensuous Theory and Multisensory Media.

Minneapolis: University of Minnesota Press.

Mulvey, Laura. (1975). Visual Pleasure and Narrative Cinema. Screen, 16(3), pp. 6-18.

Norton, David, and Lawrence Stark. (1971). Scanpaths in eye movements during pattern perception. Science, 171, pp. 308–311.

Price, Steven. (2013). A History of the Screenplay. Basingstoke: Palgrave Macmillan.

Price, Steven. (2010). The Screenplay: Authorship, Theory and Criticism. Basingstoke: Palgrave Macmillan.

Rodowick, David. (2007). The Virtual Life of Film. Cambridge, MA: Harvard University Press.

Schnotz, Wolfgang and Richard K. Lowe. (2008). A unified view of learning from animated and static graphics. In: Richard K. Lowe and Wolfgang Schnotz (eds.). Learning with animation: Research implications for design. New York: Cambridge University Press, pp. 304-356.

Schütz, Alexander C., Doris I. Braun and Karl R. Gegenfurtner. (2011). Eye movements and perception: A selective review. Journal of Vision, 11(5), pp. 9, 1–30.

Simons, Daniel J. and Christopher F. Chabris. (1999). Gorillas in our Midst: Sustained Inattentional Blindness for Dynamic Events. Perception, 28, pp. 1059-1074.

Smith, Murray (1995). Engaging Characters: Fiction, Emotion, and the Cinema. Oxford: Oxford University Press.

Smith, Tim J. (2005). An Attentional Theory of Continuity Editing. [accessed October 17, 2014].

Smith, Tim J. (2011). Watching You Watch There Will Be Blood. [accessed August 22, 2014].

Smith, Tim J. (2013). Watching you watch movies: Using eye tracking to inform cognitive film theory. In: A. P. Shimamura (ed.). Psychocinematics: Exploring Cognition at the Movies. New York: Oxford University Press, pp. 165-191.

Sobchack, Vivian (1992). The Address of the Eye: A Phenomenology of Film Experience. Princeton, N.J: Princeton University Press.

Stadler, Jane (2010). Landscape and Location in Australian Cinema. Metro, 165.

Tatler, Benjamin W., and Benjamin T. Vincent. (2009). The prominence of behavioural biases in eye guidance. Visual Cognition, 17, pp. 1029–1054.

Tobii Technology (2005). User Manual. Tobii Technology AB. Danderyd, Sweden.

Vogler, Christopher (2007). The Writer’s Journey: Mythic Structure for Writers. Studio City, CA: Michael Wiese Productions.

Wells, Paul (2010). Boards, Beats, Binaries and Bricolage – Approaches to the Animation Script. In: Jill Nelmes (ed.) Analysing the Screenplay, Abingdon: Routledge, pp. 104-120.

Wells, Paul (2007) Basics Animation 01: Scriptwriting. Worthing: AVA Publishing.

Williams, Linda (1998). Melodrama Revised. In: Nick Browne (ed.). Refiguring American Film Genres: History and Theory. Berkeley, CA: University of California Press.

Yarbus, Alfred L. (1967). Eye Movements and Vision. New York: Plenum.


List of figures

Fig. 1. A heat map showing the collective intensity of viewers’ responses to the savings jar. Source: author study.

Fig. 2. A gaze plot showing the fixations and saccades of one viewer in a scene with the prominent faces of Carl and Ellie. Source: author study.

Fig, 3. Two viewers’ gaze plots depicting the sequence of fixations made between Carl and Ellie. Source: author study.

Fig. 4. Gaze plots of fixations made by all viewers over the scene in which Carl purchases airline tickets. Source: author study.

Fig. 5. Gaze plot showing the fixations made by all viewers as they briefly see the contents of the picnic basket. Source: author study.



[i] A full analysis of this study, ‘Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative’, will appear in the forthcoming collection Making Sense of Cinema: Empirical Studies into Film Spectators and Spectatorship, edited by CarrieLynn D. Reinhard and Christopher J. Olson.

[ii] See Batty, Craig, Dyer, Adrian G., Perkins, Claire and Sita, Jodi (forthcoming) for full results.



Associate Professor Craig Batty is Creative Practice Research Leader in the School of Media and Communication, RMIT University, where he also teaches screenwriting. He is author, co-author and editor of eight books, including Screenwriters and Screenwriting: Putting Practice into Context (2014), The Creative Screenwriter: Exercises to Expand Your Craft (2012) and Movies That Move Us: Screenwriting and the Power of the Protagonist’s Journey (2011). Craig is also a screenwriter and script editor, with experiences across short film, feature film, television and online drama.

Dr Claire Perkins is Lecturer in Film and Screen Studies in the School of Media, Film and Journalism at Monash University. She is the author of American Smart Cinema (2012) and co-editor of collections including B is for Bad Cinema: Aesthetics, Politics and Cultural Value (2014) and US Independent Film After 1989: Possible Films (forthcoming, 2015). Her writing has also appeared in journals including Camera Obscura, Critical Studies in Television, Celebrity Studies and The Velvet Light Trap.

Dr Jodi Sita is Senior Lecturer in the School of Allied Health at the Australian Catholic University. She works within the areas of neuroscience and anatomy, with expertise in eye tracking research. She has extensive experience with multiple project types using eye tracking technologies and other biophysical data. As well as her current research using into viewer gaze patterns while watching moving images, she is using eye tracking to examine expertise in Australian Rules Football League coaches and players, and to examine the signature forgery process.

Movement, Attention and Movies: the Possibilities and Limitations of Eye Tracking? – Adrian G. Dyer & Sarah Pink


Movies often present a rich encapsulation of the diversity of complex visual information and other sensory qualities and affordances that are part of the worlds we inhabit. Yet we still know little about either the physiological or experiential elements of the ways in which people view movies. In this article we bring together two approaches that have not commonly been employed in audience studies, to suggest ways in which to produce novel insights into viewer attention: through the measurement of observer eye movements whilst watching movies; in combination with an anthropological approach to understanding vision as a situated practice. We thus discuss how both eye movement studies that investigate complex media such as movies need to consider some of the important principles that have been developed for sensory ethnography, and in turn how ethnographic and social research can gain important insights into aspects of human engagement from the emergence of new technologies that can better map how an understanding of the world is constructed through sensory perceptual input. We consider recent evidence that top down mediated effects like narrative do promote significant changes in how people attend to different aspects of a film, and thus how film media combined with eye tracking and ethnography may reveal much about how people build understandings of the world.


Seeing in complex environments is not a trivial task. Whilst people are often under the impression that you can believe what you see (Levin et al. 2000), physiological and neural constraints on how our visual system operates means that only a very small proportion of an overall visual scene might be reliably perceived at one point in time during the evaluation of a sequence of events. Evidence for the way in which we often only perceive a portion of the vast amount of visual information present in a scene is nicely illustrated in the ‘Gorillas in our Midst’ short (25s) motion sequence where a group of six participants (three dressed respectively in white or black teams) are filmed passing a basketball between team members (Simons and Chabris 1999). Subjects observing the film sequence are required to count the number of passes between the three students dressed in white, and whilst many subjects do correctly count the number of passes, the majority of test subjects fail to observe a large gorilla (an actor dressed as a gorilla) that walks into the middle of the visual field and beats it’s chest, before walking casually out of the scene. People typically don’t see this salient gorilla in the action sequence because their attention has been directed to the basketball catching team in white with the instruction of counting the number of passes. Why do we miss such a salient object as a gorilla, and what does this mean for our understanding of how different subjects might view complex information in real life, or in presentations that encapsulate aspects of real life, such as movies?

In this article we take an interdisciplinary approach to the question of how we might see certain things in complex dynamic environments. We draw together insights from the neurosciences and eye tracking studies, with anthropological understandings of vision and audio-visual media in order to map out an approach to audience research that accounts for the relationship between human perception, vision as a form of practical activity, and the environments through which these are co-constituted. We first build a brief outline of how the eye, visual perception and the subjectivity or selectivity of viewing are currently understood from the perspective of vision sciences. This demonstrates how physiologically there is evidence that the eye sees selectively, yet it does not fully explain why or how perceptual understanding might vary across different persons, or for the same person across different contexts. We then build on this understanding with a discussion of what we may learn from eye tracking studies with moving images. As we will show, eye tracking can offer detailed measurements of how the eye attends to specific instances, movements, and points within sequences of action. This can reveal patterns of attention across a sample of participants, towards specific types of action. Yet eye tracking is limited in that while it can tell us what participants’ eyes are attending to, it cannot easily tell us why, what they are experiencing, what their affective states are, nor how their actions are shaped by the wider social, material, sensory and atmospheric environments of which they are part. Therefore in the subsequent section we turn to phenomenological anthropology, and draw on the possibilities provided by the theoretical-ethnographic dialogue that is at the core of anthropological research, to suggest how the propositions of eye tracking studies might be situated in relation to the ongoingness and movement of complex environments.

We discuss that such an interdisciplinary approach, which brings together monitoring and measurement with qualitative and experiential research, is needed to generate understandings of not only what people view but of how these viewing practices and experiences become relevant as part of the ways in which they both perceive and participate in the making of everyday worlds. However we end the article with a note about the relative complexities of working across disciplines, and in particular between those that measure and those that use empathetic and collaborative modes of understanding and knowing, which can be theorised as part of the ways film is experienced (Bordwell and Thompson 2010; Pink 2013). For a review of how these issues may relate to broader issues about film culture, eye tracking and the moving image readers are also referred to the manuscripts in this special issue by Redmond and Batty (2015), and Smith (2015).

Visual Resolution, Perception and the Human Eye

To enable visual perception the human eye has cone photoreceptors distributed across the retina which enable wide field binocular visual perception of about 180 degrees (Leigh and Zee 2006). In the central fovea region of the eye, cone photoreceptors are much more densely packed, and our resulting high acuity vision is only about 2-3 degrees of visual angle (Leigh and Zee 2006). Visual angle is a convenient way understand the relationship between the actual size of an object and viewing distance, for example, our fovea acuity is approximately equivalent to the width of our thumb held at about 57 cm (at this distance 1cm represents 1 degree of visual angle). This means that to view visual information in detail it is often necessary to direct the gaze of our eyes to different parts of a scene, and this is typically done with either ballistic eye movements termed saccades, or much slower smooth pursuit eye movements like when we follow the movement of a slow object in the distance (Martinez-Conde et al. 2004). Saccades are commonly broken down into two main types that are of high value for interpreting how viewers might perceive their environment, including reflexive saccades mainly thought to be driven by image salience (also termed exogenous control), or volitional saccades (endogenous control) where a viewers’ internal decision making directs attention through top-down mechanisms to where the gaze should be attended within a scene or movie sequence (Martinez-Conde et al. 2004; Parkhurst et al. 2002; Tatler et al. 2014; Pashler 1998; Smith 2013). Thus eye movements can be, in very broad terms, described as ‘bottom up’ processing when the eye makes reflexive saccades to salient stimuli within a scene, or ‘top down’ when a viewer uses their volitional control to direct where the eye should look, and both types of saccade are important for understanding how we interacted with complex scenes in everyday life. For example, on entering a café we might casually gaze at the wonderful variety of cakes with reflective saccades to all the highly colourful icings; but when a friend says to ‘try the chocolate cake’ we direct our eyes only to cakes of chocolate brown colour using volitional saccades. Interestingly, these different types of saccadic eye movements are likely to involve different cortical processing of information (Martinez-Conde et al. 2004), potentially allowing for complex multi modal processing that incorporates the rich and dynamic environment experienced when viewing a movie. It is likely that both these mechanisms operate whilst subjects view a film, and the extent to which mechanism dominates during a particular film sequence may depend upon factors like visual design, narrative, audio input and cinema graphic style, as well as individual experience or demographic profile of observers.

The fact that we typically only perceive the world in low resolution at any one point in time can be easily illustrated with an eye chart in which letters of different parts of our visual field are scaled to make the letters equally legible when a subject fixates their gaze on a central fixation spot, or simulated by selectively Gaussian blurring a photograph such that it matches how we see detail at any one point in time (Figure 1). Human subjects typically shift their gaze about 3 times a second in many real world type scenarios in order to build up a detailed representation of our visual environment (Martinez-Conde et al. 2004; Tatler et al. 2014; Yarbus 1967). To efficiently direct the fovea to different parts of a visual scene, the human eye usually makes saccades, which also require a shift of the observer’s attention (Kustov et al. 1996; Martinez-Conde et al. 2004). One way to record subject gaze is to use a video-based eye tracking system that makes use of the different reflective properties of the eye to infrared radiation (Duchowski 2003), using a wavelength of radiation that is both invisible to the test subject and does not damage the eye. This non invasive technique thus enables very natural behavioural responses to be collected from a wide range of subjects. When the eye is illuminated by infrared light, which is typically provided by the eye tracking equipment, it enters the lens and is strongly reflected back by the retina providing a high contrast signal for an infrared camera to record, whilst some of the carefully placed infrared lights also reflect off the cornea of the eye which provides a constant references signal to enable eye tracker software to disentangle minor head movements from the actual eye movements of a subject. A subject is first calibrated to grid stimulus of known spatial dimensions (Dyer et al. 2006), and then when test images are viewed it is possible to accurately quantify the different regions of a scene to which the subject pays attention, the sequence order off this attention, and thus also what features of a scene may escape the direct visual attention of a viewer (Duchowski 2003). The use of this non invasive technique then directly enables the measurement of subject attention to the different components of a stimulus (Figure 2), and has been extensively employed for static images for many fields including medicine, forensics, face processing, advertising, sport and perceptual learning (Dyer et al. 2006; Horsely 2014; Russo et al. 2003; Tatler 2014; Vassallo et al. 2009; Yarbus 1967).

Figure 1. The way our eye samples the world means that only the central fovea region is viewed in detail. The left had image shows letters scaled to equal legibility when a subject fixates gaze to the central dot, and the right hand image is a photographic reconstruction of how an eye would typically resolve detail of the Sydney Harbour Bridge at one point in time.

Figure 1. The way our eye samples the world means that only the central fovea region is viewed in detail. The left had image shows letters scaled to equal legibility when a subject fixates gaze to the central dot, and the right hand image is a photographic reconstruction of how an eye would typically resolve detail of the Sydney Harbour Bridge at one point in time.

In recent times there has been a growing appreciation that to understand how the human visual system and brain processes complex information, the use of moving images has significant advantages since these stimuli may more accurately represent the very complex and dynamic visual environments in which we typically operate (Tatler et al. 2011). For example, when the eyes of a subject are tracked whilst driving a car, it can be observed that the gaze of subjects tends to be directed ahead of the responding action that a driver will take (Land and Lee 1994), and in other real life activities like making a cup of tea test subjects also tend to fixate on particular objects before an action like picking up an object (Land et al. 1999). This shows visual processing is often dynamic and may be influenced by top down volitional goals of a subject, whilst static images may not always best represent how subjects’ actions are informed by visual input in a dynamic situation (Tatler 2014). Interestingly, the capacity of subjects at visually anticipating tasks may link to performance or experience at a given action, as elite cricket batsmen viewing action can more efficiently predict the location that a ball will bounce in advance of the event, providing significant advantages for facing fast bowling where decisions must be made very quickly and accurately (Land and McLeod 2000). Thus there is evidence that visual perception and eye movements for moving images may be influenced by top down mechanisms and experience, as well as bottom up salience driven mechanisms of visual processing (Tatler 2014).

Subject viewer gaze and attention in dynamic environments can also be significantly influenced by the actions of other people who may be viewed within a scene. For example, when viewing a simple magic trick where an experienced magician in a video waves a hand to make an object disappear, the gaze direction of subjects viewing the video is heavily influenced by the actual gaze direction of the magician in the video clip (Tatler and Kuhn 2007). If the magician appears to pay attention to his waving hand then subjects follow this misdirection of viewer attention and the magi trick, performed with the other hand, cannot be detected and the magic trick is successful. However; this pattern is changed if the magician’s gaze attends the hand performing the apparent magical act, and then trick is readily detected by observes. This simple but highly effective demonstration shows how viewer experience is not only driven by reflexive bottom up salience signals present in complex images, but several top down and/or contextual factors may influence visual behaviour. The effect of dynamic complex environments affecting subject eye movements has also been observed in demonstrations of how people might encounter each other and either divert or attend their gaze depending upon prior experience, the perception of threat and/or chance of a collision (Jovancevic-Misic and Hayhoe 2009). Other evidence of top down type influences on observer gaze behaviour come from our understanding of how instructions or narrative may influence where a subject looks (Land and Tatler 2009; Tatler 2014; Yarbus 1967). For example, in the classic eye movement experiments done by Yarbus (1967), in which he presented to test subjects static images, a variety of different instructions were provided for viewing the painting ‘The Unexpected Visitor’ by Ilya Repin. These instructions included estimating the material circumstances of subjects within the painting, or the age of the subjects, the subject’s clothing; and a very different set of saccades and fixations was observed for different instructions or a free view situation that might be taken as a condition mainly driven by bottom up salience factors on perception; showing that top down view goals strongly influenced the way in which eye gaze is directed (Tatler 2014; Yarbus 1967).

Eye Tracking For Understanding Dynamic and Complex Visual Information

Whilst these clever, and comparatively complex, evaluations of visual perception are currently teaching us a lot about human visual performance and viewer experience, the current rapid advances in computer technology and eye tracking are now starting to enable the testing of how subjects view very complex dynamic environments as encapsulated in movies (Mital et al. 2011; Smith and Henderson 2008; Smith and Mital 2013; Smith et al. 2012; Treuting 2006; Vig et al. 2009). This potentially allows for new insights into increasingly real world type viewer experience, how the visual system potentially processes very complex information, and how viewers from different demographics may interpret information content in films. For example, some recent work has looked at subject viewer attention within movies and observed high levels of attention to faces (Treuting 2006), revealing consistent behaviours to previous work that used static images (Vassallo et al. 2009; Yarbus 1967), but a wealth of opportunities are becoming available for better understanding real world visual processing.

Figure 2. When we view an image our eyes often fixate on key areas on interest for short periods of about a third of a second, and then the eyes may make ballistic shifts (saccades) to other features. When a typical subject viewed sequential images from the film 'UP', fixations (green circles) mainly centred on the respective faces of main characters, whilst lines between fixations show the direction of respective saccades [image from Craig Batty, Adrian Dyer, Claire Perkins and Jodi Sita. 2014 Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative ( Bloomsbury, 2015) with permission].

Figure 2. When we view an image our eyes often fixate on key areas on interest for short periods of about a third of a second, and then the eyes may make ballistic shifts (saccades) to other features. When a typical subject viewed sequential images from the film ‘UP’, fixations (green circles) mainly centred on the respective faces of main characters, whilst lines between fixations show the direction of respective saccades [image from Craig Batty, Adrian Dyer, Claire Perkins and Jodi Sita. 2014 Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative ( Bloomsbury, 2015) with permission].

A current issue of how to interpret eye movement data for subjects viewing a film is how such a large volume of data can be managed and statistically separated to try and interpret viewer experience. One initial solution is gaze plot analyses which shows the average attention of a number of subjects to a particular scene (Fig. 3). Investigations on still images using gaze plot analyses have indicated a strong tendency for central bias to an image that is largely independent of factors like subject matter or composition (Tatler 2007). Studies on moving images appear to confirm a tendency for viewing restricted parts of the overall image in detail (Dorr et al. 2010; Goldstein et al. 2007; Mital et al. 2011; Smith and Henderson 2008; Tosi et al. 1997), which may hold important implications for data compression type algorithms where large amounts of image data may be streamed to a variety of different mobile viewing devices such that certain information does not have to be displayed at high resolution due to the resolution of the human eye (Fig. 1), or even certain parts of the movie may be modified to enhance viewing experience for visually impaired viewers (Goldstein et al. 2007). Despite the qualitative value of gaze plot displays, quantitative analyses can be better facilitated by allocating Areas of Interest (Fig. 4) to certain components of a scene that are hypothesised to be of high value for dissecting different theories about information processing of moving images. For example, one of the current issues in understanding how eye tracking can inform film culture, and how movies can be a useful stimulus for understanding visual behaviour is having a method that can explore the potential effects of narrative which is a hypothesised top down or endogenous control on viewer gaze behaviour when subjects are freely viewing a movie to enable natural behaviour (Smith 2013).

FIGURE 3: Gaze plot shows the mean attention of a number of viewers (n=12) to a particular scene. In this case faces capture most attention consistent with previous reports (Yarbus 1967, Vassallo et al. 2009). [image from Craig Batty, Adrian Dyer, Claire Perkins and Jodi Sita. 2014 Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative ( Bloomsbury, 2015) with permission].

FIGURE 3: Gaze plot shows the mean attention of a number of viewers (n=12) to a particular scene. In this case faces capture most attention consistent with previous reports (Yarbus 1967, Vassallo et al. 2009). [image from Craig Batty, Adrian Dyer, Claire Perkins and Jodi Sita. 2014 Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative ( Bloomsbury, 2015) with permission].

Recently one study has tackled this question by using a montage sequence from the animation movie ‘Up’ (Pete Docter, 2009) to explore if it is possible to collect empirical evidence that supports modulation between bottom up and top down mechanisms. The animation montage is a high value study case as it encapsulates a lifetime of narrative story within a 262s montage film sequence that contains no dialogue (Batty 2011), and the overall salience of the two principle characters ‘Carl’ and ‘Elli’ depicted within the montage is somewhat consistently matched due to the control exhibited by animation production. For example, in an initial opening scene where these two characters are first encountered by subjects viewing the film there was an almost identical percentage of time to Carl and Elli respectively; however, as the montage unfolds with a life story narrative of marriage, dreams of children, miscarriage, dreams of travel, illness and death; there is a significant difference in the amount of attention paid to the respective characters by viewers at different stages in the montage (Batty et al. 2014). This suggests that influences of top down type processing on the overall salience of complex images as have been observed in some studies using short motion displays in laboratory type conditions (Jovancevic-Misic and Hayhoe 2009; Tatler and Kuhn 2007; Tatler 2014), is a promising avenue of investigation for movie studies if it is possible to design protocols for controlling the many factors that can influence image salience (Parkhurst et al. 2002; Martinez-Conde et al. 2004; Tatler et al. 2014).

FIGURE 4. Areas of interest can be programmed to quantify the number and respective duration of fixations to key components within a scene of a movie, which may allow for the dissection of how factors like narrative influence viewer behaviour. saccades [from Craig Batty, Adrian Dyer, Claire Perkins and Jodi Sita. 2014. Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative (Bloomsbury, 2015) with permission].

FIGURE 4. Areas of interest can be programmed to quantify the number and respective duration of fixations to key components within a scene of a movie, which may allow for the dissection of how factors like narrative influence viewer behaviour. saccades [from Craig Batty, Adrian Dyer, Claire Perkins and Jodi Sita. 2014. Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative (Bloomsbury, 2015) with permission].

Yet because eye tracking can only tell us part of the story – that is, what people look at, and not how and why these ways of looking emerge and are enacted – other qualitative research approaches such as those used in visual and sensory ethnography (Pink 2013; Pink 2015) are needed to put eye tracking data into context. This involves approaching viewing and the practices of vision that it entails as situated activities, and as part of a broader experiential repertoire beyond the eye. The subjectivity and selectivity of viewing that the studies outlined above have evidenced, once documented and measured, can only be properly understood as emergent from particular (and always complex) environmental conditions and embodied experiences. In the next section we therefore turn to anthropological approaches to vision and the environment in order to show how this might be achieved. However before proceeding we note that when working across disciplines, there is inevitably a certain amount of conceptual slippage. Here this means that whereas we have ended the previous paragraph by suggesting that eye tracking enables our understanding of how complex environmental information is processed, in the next sections we refigure this way of thinking to consider how human perception and viewer experience are constituted in relation to the affordances of complex environments of which they are also part.

Situating Viewing as Part of Complex Environments

The environment, as a concept, is slippery and is used to different empirical and political ends in different contexts. As the anthropologist Tim Ingold has emphasised, in contemporary discourses ‘the environment’ is often referred to as an entity and as something that we exist as separate from. Indeed, this idea is present in our discussion above whereby we have considered how eye tracking studies might better show us how we process complex environmental information. As Ingold expresses it this means we are ‘inclined to forget that the environment is, in the first place, a world we live in, and not a world we look at’. He argues that ‘We inhabit our environment: we are part of it; and through this practice of habitation it becomes part of us too’ (Ingold T 2011). Following this approach the environment can be understood as an ecology that humans are part of and with which we, and the ways we view, see and experience are mutually constituted. This however does not simply mean that ‘we’ as humans are encompassed by the environment, it means that the environment is co-constituted by us and our relationships with other constituents, which for our purposes in this article we would emphasise includes, film, images, art, technologies, other humans, the weather, the built environment (as well as much more). As advanced by Ingold (Ingold 2000, 2010) and art historian Barbara Stafford (Stafford 2006), approaches that critique linguistic and semiotic studies invite an analysis which acknowledges that – as Stafford puts it – ‘when you open your eyes and actively interrogate the visual scene, what you see is that aspect, or the physical fragments, of the environment that you perform’ (Stafford 2006). This however means also that the experience of film does not simply involve us looking at something that is external to us, but it is through the affordances of film that, in relation to the other constituents of our environments/worlds that viewing becomes meaningful to us. In this interpretation the use of ‘we’ derives from the development of a universal theory of human perception and our relationship to a (complex) environment. Yet, as we explain in the next section this rendering does not dismiss the idea that different people may often perceive the same information differently, and indeed to the contrary invites us to study precisely how and why difference emerges.

If we take Ingold’s approach further, to focus in on how meanings are generated through our engagements with and experiences of visual images, we can gain an appreciation of how the measurement and monitoring patterns that emerge from eye tracking studies are materialisations or representations not just of how the eye (or the mind) responds to the moving image. Rather they can be understood as standing for (but not actually explaining the meaning of) what people do with the moving image. Building on philosophical and other traditions emerging from the work of Merleau-Ponty, Gibson and Jonas, Ingold, has argued that human perception, learning and knowing emerge from movement, specifically as we move through our environments and engaging with the affordances of those other things and processes we encounter (Ingold 2000). With regard to art, he has used this approach to suggest that therefore …

Should the drawing or painting be understood as a final image to be inspected and interpreted, as is conventional in studies of visual culture, or should we rather think of it as a node in a matrix of trails to be followed by observant eyes? Are drawings or paintings of things in the world, or are they like things in the world, in the sense that we have to find our ways through and among them, inhabiting them as we do the world itself? (Ingold 2010; p16)

If we transfer this idea to the question of how we view film we might then ask the question of how we, as viewers, inhabit film? And what then eye tracking studies can tell us about these forms of habitation. If we see the relationship established between the viewer (‘s eyes) and the film by eye tracking visualisations such as those demonstrated in the earlier sections of this article, we can begin to think of how the movement of the eye and the movement of the film become entangled. Indeed while the film and the eye will both inevitably continue to move, the question becomes not simply how the composition and action in the screen influences the movement of the eye, but rather how the eye selects the aspects of the composition and action of the screen with which to move. By taking this perspective, we are able to remove something of the technological determinism that underpins assumptions that eye tracking studies might better enable film, advertising organisations to better influence viewing behaviours. Instead it directs us towards considering what eye tracking studies might tell us about what people do when they view, and how this can inform us about how they inhabit a world of which film and more generally the moving image as an ubiquitous presence.

The work and arguments discussed thus far in this section have focused on interpreting the question of how, at a general level, people see when they are viewing moving images. The theories advanced as yet however neither explain nor discuss the usefulness of attending to the patterning of eye tracking studies. Moreover the examples and visualisations we have shown of eye tracking studies in the earlier sections of this article were undertaken with a sample of people who were likely to have similar viewing perspectives, and as might therefore be expected showed distinct patterns in the ways that people view particular information. Indeed the data that would be needed to tell us to what extent such viewing patterns were universal – that is supported by studies and theories of the ways in which the human brain processes – and to what extent they were situationally and biographically constituted for this particular group of participants still does not exist as far as we know. Such work would be of high value given the increasing globalisation of both entertainment industries and forms of activism that use visual media; where films may be distributed in markets distant to the original context to which audience experience is understood. Indeed, studies of how people learn to look and know, undertaken in culturally specific contexts definitely reveal that where we look and what we see is contingent on processes of learning and apprenticeship, and therefore specific to complex environments.

Vision, Learning and Knowing

Eye tracking studies have shown us that there are sometimes similarities and patterns in the ways people view and remember complex images (Norton and Stark 1971), although if present such patterns are easily changed through instruction (Yarbus 1967; Tatler 2014). We have seen in the earlier sections of this manuscript that participants in studies have consistently fixed their gaze on the faces of film characters (Figs 2, 3), and that visual attention may become focused on a film character whose story line commands (or affords) particularly powerful affective and/or empathetic connections for viewers. Further eye tracking research would be needed to underpin any proposals that such ways of viewing are both gendered and culturally specific, however existing research in visual and media anthropology indicates that this is likely to be the case. Two bodies of literature are relevant here. First the applied visual and media anthropology literature, and second the anthropology of vision.

Applied visual and media anthropology studies (Pink 2007) focus on using anthropological understandings of media, along with audiovisual interventions (often in the form of filmmaking processes and film products) to work towards new forms of social and public awareness, and societal change. This work draws on and advances a strand in film studies developed in the work of Laura Marks, who has advanced the idea of the ‘embodied viewing experience’ (2000: 211). Marks, whose work focuses on intercultural cinema has argued that as ‘a mimetic medium’ cinema is ‘capable of drawing us into sensory participation with its world’ (Marks 2000: 214). The notion of empathy as a route towards creating intercultural understanding through film is also increasingly popular in the visual anthropology literature (discussed in Pink 2015). While on the whole there has been insufficient research into the ways in which people view intervention films of this kind, one example that has been undertaken implies how viewer attention, and importantly viewer’s capacity to engage with and remember film narrative can depend on the ways in which they are able to affectively or empathetically engage with the experiences of film characters. Susan Levine’s media anthropology study of how viewers discussed a film made as part of a South African HIV/AIDs intervention campaign, and which drew on local narratives to communicate the central message, is a good example (Levine 2007). Levine (unsurprisingly) found that participants engaged with the stories of film characters that followed locally relevant narratives, thus generating important lessons for filmmaking campaigns of this kind, where it is often difficult to communicate generic health messages to local audiences. The bridge between this type of anthropological understanding and a capacity to map viewer attention to faces and expressions within visual representations (Vassallo et al. 2009) may allow for more comprehensive understandings of why film is such a powerful medium for communication.

Anthropological studies of vision provide further evidence of the importance of attending to how seeing is situated. Indeed when vision is understood as a practice, rather than as a behavior, it is not just a situated practice, but it is a practice that is learned through participation. The anthropologist Cristina Grasseni has developed a theory of what she calls ‘skilled vision’ though which to explain this (Grasseni 2004, 2007, 2011), as she puts it:

The “skilled visions” approach considers vision as a social activity, a proactive engagement with the world, a realm of expertise that depends heavily on trained perception and on a structured environment (Grasseni 2011).

Emphasizing that skilled visions are ‘positional, political and relational’ as well as sensuous and corporeal, Grasseni points out that ‘Because skilled visions combine aspects of embodiment (as an educated capacity for selective perception) and of apprenticeship, they are both ecological and ideological, in the sense that they inform worldviews and practice’ (Grasseni 2011). As Pink has shown through her work on the Spanish bullfight, what one sees when viewing the performance is highly contingent on how one has learned to view, ones own empathetic embodied ways of sensorially and affectively ‘feeling’ the performance at which a visual representation was created, or how one’s existing ways of knowing and understanding the world can inform perception (Pink 1997; 2011). For example, consider the different ways in which Figure 5, or a film sequence around the same performance, would be interpreted by a bullfighting fan and an animal rights activist. Each will have learned how and what to know about this performance through different trajectories. Whilst an eye tracking investigation of respective subjects might show somewhat similar patterns (especially if bottom up mechanisms dominate), the semantic interpretation of the visual input by respective viewers may be completely different. How such information content might be assessable, or not, through evaluation of bottom up or top down type mechanisms involved with visual processing will be a major challenge for interpretation of information as complex as can typically be perceived in a movie.


Figure 5. How emotive content, as is common in many films, may influence the perception of visual images even if the same information is present to viewers remains a major topic for exploration. For example, we know that the bullfight is interpreted, and affectively experienced, very differently when viewed by bullfight fans and animal rights activists. We also know that learning how to view the bullfight, as a bullfight fan, is a process of cultural apprenticeship (see for example Pink 1997). Consider how for the above image the action of a bull fight could promote very different visual behavior depending upon cultural context, whether a subject was a bullfighting fan and an animal rights activist, or the representation was depicted as animation instead of real life, or motion compared to a still image. Copyright: Sarah Pink.

Figure 5. How emotive content, as is common in many films, may influence the perception of visual images even if the same information is present to viewers remains a major topic for exploration. For example, we know that the bullfight is interpreted, and affectively experienced, very differently when viewed by bullfight fans and animal rights activists. We also know that learning how to view the bullfight, as a bullfight fan, is a process of cultural apprenticeship (see for example Pink 1997). Consider how for the above image the action of a bull fight could promote very different visual behavior depending upon cultural context, whether a subject was a bullfighting fan and an animal rights activist, or the representation was depicted as animation instead of real life, or motion compared to a still image. Copyright: Sarah Pink.

Bringing together measurement and monitoring data with anthropologically informed ethnographic ways of knowing, which are always collaboratively crafted and sensorially and tacitly known is increasingly common. For instance in energy research a number of projects seek to combine ethnographic and energy consumption measurement data (Cosar et al. 2013). Such an approach has not yet been integrated in eye tracking studies of movies, yet this would be the next step if we were to want to understand better the significance and relevance of the types of data and knowledge that eye tracking studies can offer us, for understanding film audiences. This however presents certain challenges, which both impinge on, but are not necessarily unique to, the use of eye tracking data in audience research. The first challenge is to generate sufficient interdisciplinary understanding between the approaches involved. This article has intended to initiate that process. That is it has explained how eye tracking and anthropological-ethnographic (that is at once theoretical and practical) approaches offer different, and differently theorised perspectives on the ways in which people look at and participate in the viewing of film. It has simultaneously however suggested that these different approaches and disciplines offer something to each other that enable new questions to be asked, and therefore is able to develop deeper understandings of how audiences view film.

Future work testing human visual behaviour with complex stimuli as are typically present in movies may help build our understanding of how humans sometimes process very complex information to build an understanding of our surrounding world, but sometimes also miss salient information in complex moving images such as the Gorillas in our midst study. Current theories suggests that perceptual blindness to salient and recognisable stimuli when our attention is captured by other competing stimuli that impose a cognitive load to process (Simons and Chabris 1999; Levin et al. 2000; Memmert 2006), but more fully exploring effects of narrative or instructions, character gaze and other potential top down mechanisms will likely be fruitful contributions to our knowledge on perceptual blindness. Indeed, as discussed above in relation to anthropological and ethnographic factors, the potential role of factors like experience do appear to modulate the ability of subjects to detect a gorilla in a perceptual blindness type test (Memmert 2006), potentially suggesting that future investigations on eye tracking and movies should consider the broad range of human experience that can influence our perception. This type of research is likely to also provide for richer understandings in some ethnographic studies as researchers will have, possibly for the first time, access to precise quantitative data on whether an observer actually failed to even look at certain objects in a scene; or indeed if such information, like an unexpected gorilla in a basketball game, was viewed but not directly perceived (Memmert 2006). Many individual scenes within a film are typically short of about 4s duration and so it is often only possible for viewers to process a small percentage of the entire visual presentation in detail, especially in cases where movies are subscripted (Smith 2013). This means that elements of a film that might be essential to the complete comprehension of narrative story line may be easily missed by a percentage of an audience depending upon their individual knowledgebase, linguistic skills, attention and motivation; and eye tracking potentially offers film makers with a useful vehicle to test different demographic groups to better understand how different components of scenes might be constructed to enhance viewer experience, and also build our understanding of how we process very complex environmental information.



Acknowledgements. We are very grateful to Dr Craig Batty, Dr Claire Perkins and Dr Jodi Sita for discussions and permission to use images from their collaborative work with one of us (AGD), and for broader discussions with members of the Eye Tracking of the Moving Image research group. AGD acknowledges funding support from the Australian Research Council (LE130100112) for eye tracking equipment. We are grateful to Dr Lalina Muir for her careful proofreading of the manuscript.



Batty, Craig. 2011. Movies That Move Us: Screenwriting and the Power of the Protagonist’s Journey. Basingstoke: Palgrave Macmillan.

Batty, Craig, Dyer, Adrain, G., Perkins, Claire, and Sita, Jodi. 2015. Seeing Animated Worlds: Eye Tracking and the Spectator’s Experience of Narrative (Palgrave, forthcoming)

Cosar Jorda, P, Buswell, RA, Webb, LH, Leder Mackley, K, Morosanu, R, and Pink, Sarah. 2013. ‘Energy in the home: Everyday life and the effect on time of use.’ In The Proceedings of the 13th International Conference on Building Simulation 2013. Chambery, France. 25-28/8/2013.

Docter, P. 2009. Up. Disney-Pixar Motion Film.

Dorr, M, Martinetz, T, Gegenfurtner, KR, and Barth, E. 2010. ‘Variability of eye movements when viewing dynamic natural scenes.’ Journal of Vision 10 (28): 1-17.

Duchowski, Andrew. 2003. Eye tracking methodology: theory and practice. London: Springer-Verlag.

Dyer, Adrian, G., Found, Brian, and Rogers, Doug. 2006. ‘Visual attention and expertise for forensic signature analysis.’ Journal of Forensic Science 51: 1397–1404.

Goldstein, Robert, B., Woods, Russell,L., and Peli, Eli. 2007. ‘Where people look when watching movies: Do all viewers look at the same place?’ Computers in Biology and Medicine 37 (7): 957-964.

Grasseni, Cristina. 2004. ‘Video and ethnographic knowledge: skilled vision in the practice of breeding.’ In Working Images, edited by S Pink, L Kürti, and AI Afonso, 259-288. London: Routledge.

Grasseni, Cristina. 2007. Skilled Visions. Oxford: Berghahn.

Grasseni, Cristina. 2011. ‘Skilled Visions: Toward an Ecology of Visual Inscriptions.’ In Made to be Seen: Perspectives on the History of Visual Anthropology, edited by M. Banks and J. Ruby. Chicago: University of Chicago Press.

Horsely, Mike. 2014. ‘Eye Tracking as a Research Method in Social and Marketing Applications.’ In Current Trends in Eye Tracking Research, edited by M Horsley et al., 179-182. Springer, London.

Ingold, Tim. 2000. The Perception of the Environment. London: Routledge.

Ingold, Tim. 2010. ‘Ways of mind-walking: reading, writing, painting.’ Visual Studies, 25 (1): 15–23

Ingold, Tim. 2011. Being Alive. Oxford: Routledge. p 95.

Jovancevic-Misic, Jelena, and Hayhoe, Mary. 2009. ‘Adaptive Gaze Control in Natural Environments.’ Journal of Neuroscience 29 (19): 6234–6238. DOI:10.1523/JNEUROSCI.5570-08.2009.

Kustov, Alexander, A., and Robinson, David Lee. 1996. ‘Shared neural control of attentional shifts and eye movements.’ Nature 384: 74–77.

Levine, Susan. 2007. ‘Steps for the Future: HIV/AIDS, Media Activism and Applied Visual Anthropology in Southern Africa.’ In Visual Interventions, edited by S. Pink, 71-89. Oxford: Berghahn.

Marks, Laura. 2000. The Skin of the Film: Intercultural Cinema, Embodiment, and the Senses. Durham and London: Duke University Press

Martinez-Conde, Susana, Macknik, Stephen, L., and Hubel, David, H. 2004. ‘The role of fixational eye movements in visual perception.’ Nature Neuroscience 5: 229–240.

Memmert, Daniel. 2006. ‘The effects of eye movements, age, and expertise on inattentional blindness.’ Consciousness and Cognition 15 (3): 620–627.

Mital, Parag, K., Smith, Tim,J., Hill, Robin, L., and Henderson, John, M. 2011. ‘Clustering of gaze during dynamic scene viewing is predicted by motion.’ Cognitive Computation 3, 5–24.

Nodine. Calvin, F., Mello-Thoms. Claudia, Kundel. Harold, L., and Weinstein, Susan, P. 2002. ‘Time course of perception and decision making during mammographic interpretation.’ American Journal Roentgenol 179: 917–923

Norton, David, and Stark, Lawrence. 1971. ‘Scanpaths in eye movements during pattern perception.’ Science 171: 308–311.

Parkhurst, Derrick, Law, Klinton, and Niebur, Ernst. 2002. ‘Modeling the role of salience in the allocation of overt visual attention.’ Vision Research 42: 107–123.

Pashler, Harold. 1998. Attention. Hove, UK: Psychology Press Ltd.

Russo, Francesco, Pitzalis, Sabrina, and Spinell, Donatella. 2003. ‘Fixation stability and saccadic latency in elite shooters.’ Vision Research 43: 1837–1845.

Pink, Sarah. 1997. Women and Bullfighting. Oxford: Berghahn.

Pink, Sarah. 2007. (ed) Visual Interventions. Oxford: Berghahn.

Pink, Sarah. 2011. ‘From Embodiment to Emplacement: re-thinking bodies, senses and spatialities.’ In Sport, Education and Society (SES), special issue on New Directions, New Questions. Social Theory, Education and Embodiment 16(34): 343-355.

Pink, Sarah. 2013. Doing Visual Ethnography, 3rd edition. London: Sage.

Pink, Sarah. 2015 Doing Sensory Ethnography, 2nd edition London: Sage.

Simons, Daniel, J., and Chabris, Christopher, F. 1999. Gorillas in our midst: sustained inattentional blindness for dynamic events. Perception 28(9): 1059-1074.

Smith, Tim. J., and Henderson, Jordan. 2008. ‘Edit blindness: The relationship between attention and global change blindness in dynamic scenes.’ Journal of Eye Movement Research 2: 1–17.

Smith, Tim, J. 2013. ‘Watching you watch movies: Using eye tracking to inform cognitive film theory.’ In Psychocinematics: Exploring Cognition at the Movies edited by A. P. Shimamura, 165-191. New York: Oxford University Press

Smith, T, Levin, D, and Cutting J. 2012. ‘A Window on Reality: Perceiving Edited Moving Images.’ Current Directions in Psychological Science 21(2): 107-113. doi: 10.1177/0963721412437407

Smith, Tim, j., and Mital, Parag, K. 2013. ‘Attentional synchrony and the influence of viewing task on gaze behaviour in static and dynamic scenes.’ Journal of Vision 13 (8): 16.

Stafford, Barbara Maria. 2006. Echo Objects: the Cognitive Work of Images. Chicago: University of Chicago Press.

Tatler, Ben, W. 2007. ‘The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions.’ Journal of Vision 7(14): 4, 1–17. http://, doi:10.1167/ 7.14.4.

Tatler, Ben, W. 2014. ‘Eye Movements from Laboratory to Life.’ In Current Trends in Eye Tracking Research edited by Horsley et al., p17-35.

Tatler, Ben, W., and Kuhn, Gustav. 2007. ‘Don’t look now: The magic of misdirection.’ In Eye Movements: A window on mind and brain, edited by R van Gopel, M Fischer, W Murray and R Hill, 697–714. Amsterdam: Elsevier.

Tatler, Ben, W., Hayhoe, Mary, M., Land, Michael, F., and Ballard, Dana, H. 2011. ‘Eye guidance in natural vision: Reinterpreting salience.’ Journal of Vision 11 (5): 1–23., doi:10.1167/11.5.5.

Tatler, Ben, W., Kirtley, Claire, Macdonald, Ross. G., Mitchell, Katy, MA., and Savage, Steven, W. 2014. ‘The Active Eye: Perspectives on Eye Movement Research.’ In Current Trends in Eye Tracking Research, 3-16. DOI 10.1007/978-3-319-02868-2_16 Print ISBN 978-3-319-02867-5 Online ISBN 978-3-319-02868-2

Treuting, Jennifer. 2006. ‘Eye tracking and cinema: A study of film theory and visual perception.’ Society of Motion Picture and Television Engineers 115 (1): 31-40.

Tosi, Virgilio, Mecacci, Luciano, and Pasquali, Elio. 1997. ‘Scanning eye movements made when viewing film: Preliminary observations.’ International Journal of Neuroscience 92 (1/2): 47-52.

Vassallo, Suzanne, Cooper, Sian, LC., and Douglas, Jacinta, M. 2009. ‘Visual scanning in the recognition of facial affect: Is there an observer sex difference?’ Journal of Vision 9: 1-10.

Vig, Eleonora, Dorr, Michael, and Barth, Erhardt. 2009.’ Efficient visual coding and the predictability of eye movements on natural movies.’ Spatial Vision 22 (2): 397-408.

Yarbus Alfred, L. 1967. Eye Movements and Vision. New York: Plenum.

Adrian Dyer is an Associate Professor in Media and Communication at RMIT University (Australia) investigating vision in complex environments. He is an Alexander von Humboldt Fellow (Germany) and a Queen Elizabeth II Fellow (Australia), and has completed postdoctoral positions at La Trobe University and Monash University (Australia), Cambridge University (UK), and Wuerzburg and Mainz Universities (Germany).

Sarah Pink is Professor of Design and Media Ethnography at RMIT University (Australia). She is visiting/guest Professor at Halmstad University (Sweden), Loughborough University (UK), and Free University Berlin (Germany). Her most recent books include Situating Everyday Life (2012), Doing Visual Ethnography 3rd edition (2013) and Doing Sensory Ethnography 2nd edition (2015).

Editorial – Seeing Into Things: Eye Tracking the Moving Image – Sean Redmond & Craig Batty

Seeing into Things

We chose Seeing into Things: Eye Tracking the Moving Image as the title of this special edition to foreground the importance of reaching beyond – and beneath – the surface of the screen and the worlds that it creates and envisions. Through the empirical data that eye tracking affords us we are able to evidence and account for the depth in perception and sensibility that accompanies or anchors viewing. Seeing into Things is also recognition of the layers – or epidermi – of technological vision: depth cues, focal length, camera movement, and the delicious qualities of mise-en-scene all invite or demand that the image is looked into. There is much to observe across the textures and texturality of the screen. Eye tracking technology sees into the eyes of the viewer who peers – pierces – into the immersive world of the screen, factual or fictional. We see beauty in this alignment between the eye tracker, the viewer, and the screen. As this special edition finds, Seeing into Things is an enriching and intoxicating way of (re)discovering the complexities of viewing the moving image.

The poetry in and of seeing is not simply experiential, but connected to neurological, anatomical and cognitive processes. It is also connected to culture, discourse and ideology, where seeing into things is always gendered, classed and raced, amongst other encultured practices and modes of being in the world. Seeing into Things enables us to see into ourselves and into the complex and sometimes messy relationships between biology and culture, the human and technology, and between eye, brain, body and ear.

This last carnal conjunction is essential to the work being undertaken in this special edition because Seeing into Things is also meant to critically draw attention to the ocularcentric way through which the world is presently imagined to be experienced. Our position here is not to support this insightful supremacy, but rather to offer challenges and counter-points to it. When we see into things in this special edition it is with the need to recognise the centrality of hearing to seeing; of touching to viewing; and of the incorporation of the full human sensorium as it is taken up and in, and extends itself towards, the screen worlds that move and affect it. As Vivian Sobchack (2000) observes:

As “lived bodies” (to use a phenomenological term that insists on “the” objective body as always also lived subjectively as “my” body, diacritically invested and active in making sense and meaning in and of the world), our vision is always already “fleshed out”–and even at the movies it is “in-formed” and given meaning by our other sensory means of access to the world: our capacity not only to hear, but also to touch, to smell, to taste, and always to proprioceptively feel our dimension and movement in the world. In sum, the film experience is meaningful not to the side of my body, but because of my body.

The eye, brain, body and ear conjunction is also recognition that in order to understand viewing processes, one needs to incorporate different academic disciplines and approaches; from the vision sciences, neuroscience and linguistics; from ethnography and anthropology; and from the arteries and veins of creative practice, to the orbital concerns of the phenomenological. To do our work properly, then, Seeing into Things requires the eyes, brains, bodies and ears of scientists, anatomists, anthropologists, musicologists, filmmakers, screenwriters, and screen theorists, amongst others. It is this exciting arts-science nexus that this special edition draws uniquely from and is built upon, offering a foundational intervention into the way one makes critical and creative sense of viewers’ engagement with the moving image.

But what are the origins of this interdisciplinary approach? From where did the impetus for Seeing into Things come? Let us now return to the origins of the formation of the research group that drives many of the articles in this edition. Let us set the cinematic mood for some groundbreaking eye tracking research.

In the Mood for Eye Tracking Research

After a screening of the film In the Mood for Love (Kar-wai, 2000), Sean Redmond, a film and television scholar, mentioned to neuroscientist and anatomist Jodi Sita, how its rich colour scheme, expressionistic lighting and meandering narrative had fascinated and affected him. He suggested that he was sure his eyes were focusing on these visual elements, as they were being foregrounded, but also that they ‘wandered’ about the screen, choosing to look at motifs, characters and textures at their own volition, and where ‘mood’ took them. Sean contended that his eyes, or the way he viewed film, were both under the command of the film’s narrative and aesthetics, but were also free to discover the opulent fictive world for themselves. He suggested that viewing is an embodied experience.

Jodi responded quite directly: how do you know this? What evidence do you have? She continued that perception and comprehension are cognitive processes, and that what the eyes attend to in any viewing context can be measured objectively and understood through eye tracking, as well as other physiological technologies such as the measurement of pupil dilation. In that moment an arts-science debate was ignited, and an idea for an empirically driven eye tracking the moving image research group was born. They were now in the mood for some landmark empirical research of the moving image.

Jodi and Sean set up the Melbourne-based Eye Tracking and the Moving Image Research group at the end of 2012. They had two central goals in bringing the group together: one, they wanted to utilise eye tracking technology more centrally in the analysis and examination of the moving image; and two, they wanted to draw together scholars and practitioners from the Sciences, and the (Creative) Arts and Humanities so that different modes of enquiry, and theoretical and methodological apparatus, were placed in the same analytical arena (see Jodi’s account of the group’s formation in this edition). It was felt that having a room full of filmmakers, artists, film and cultural theorists, screenwriters, visual ethnographers, vision scientists and neuroscientists would generate new and exciting conversations and deliberations about how viewers engage with the moving image. To employ a games analogy, Jodi and Sean felt it was as if we had we all pinned our tails to different parts of the donkey, but that through opening our eyes together, we would all finally get to see and comprehend its full and glorious anatomy.

Their desire was to build upon existing research that drew disparate disciplines together, extending the type of work being conducted in arts-science research centres such as the Department of Psychology, Neuroscience and Behaviour’s NeuroArts Lab at McMaster University, Hamilton, Ontario. The formation of the group created a strong commitment to inter-disciplinary and cross-institutional relationships, and to what was considered a necessary dialogue between different disciplines united by a shared desire: to investigate vision regimes in relation to the affecting power and beauty of the moving image.

The utilisation of eye tracking technology was thus not born out of a technological determinism, but as a tool to bridge and fuse different approaches and methodologies in order that new findings, new knowledge, and new ways of understanding seeing and sensing images could emerge. This approach drew upon work by scholars who had already ‘crossed the line’, so-to-speak, including the work of Uri Hasson, Ohad Landesman, Barbara Knappmeyer, Ignacio Vallines, Nava Rubin, and David J. Heeger; who had already introduced to the field the idea of neurocinematics, the neuroscience of film, and the ‘inter-subject correlation analysis (ISC) … used to assess similarities in the spatiotemporal responses across viewers’ brains during movie watching’ (2008: 1). But you may ask: what is eye tracking?

But what is Eye Tracking?

Eye tracking enables us to empirically measure what viewers look at when watching screen-based media. The technology allows us to gather data from all platforms, interfaces and portals through which the moving image is distributed and consumed, including the television set, the cinema screen, the computer, and mobile devices such as smartphones and tablets. It also enables us to enter different types of environment to record viewing patterns, including the home and public spaces, such as the mall or the commute to work.  Analysis of viewers’ engagement with the moving image includes assessing where they look; interpreting why and how they look within determined visual fields or Areas of Interest (AOIs); and exploring what they feel or experience when they look. One can employ eye trackers to analyse viewer engagement with elements such as narrative, cinematography, editing, aesthetics, sound design and score, and characterization – elements that feature in many of the articles in this special edition. To do so, however, requires not only recognition and understanding of the languages employed in telling moving image stories, but also engagement with the science of the eye and the physiological and cardio-vascular transformations that take place when screen content is being viewed. To this end, a range of supportive investigative and methodological tools is also often employed, including the measurement of pupil dilation and the monitoring of heart and breathing rates.

Eye trackers work by shining infrared light onto the eye, which is then reflected back and captured by a sensor. The way we view images involves rapid eye movements that alter between points of fixation, in which the visual system gathers information and quickly moves between fixations called saccades. The sensor allows these eye movements (fixations) to be tracked, and specialist software then visualises these movements in the form of heat maps, swarms and gaze plot graphs. Statistical data can be extracted from these visualisations, and an interpretative framework can also be employed. For example, heat maps show effectively the weighting of all the viewing that occurred in a given scene, and gaze plots show the location of the fixations as well as the sequence in which they were made. To draw conclusions from this data, an area of interest analysis can be performed in which the number of times viewers visited specific objects or areas can be computed. By then analysing the amount of processing time spent in these areas, researchers are able to consider things such as the number of return visits made, building a picture of what was concentrated on. As can be seen from the articles in this special edition, analysis of this data draws us into open, and sometimes competing, exchanges about what has been discovered and why.

Double Dialogue

The articles in this edition are engaged in what we would like to define as a double dialogue. Each of the articles stands in their own right as discrete research, and yet they are also engaged in reflective and reflexive commentary. This dialoguing happens both within articles (see, for example, Redmond et al.) and also across articles (see, for example, Dyer and Pink, who draw upon the work of Batty et al.), to explore the possibilities and limitations of eye tracking research. The conversations that emerge enable the arts-science nexus to gather its power, since the different approaches to the text and their findings are foregrounded, drawn into syncretic union, or else are openly contested (see, for example, Brown and Smith’s engagement with each other’s work).

One can read this special edition, then, as the literal embodiment of the grounded work that takes place in a shared, respectful and mutually supportive interdisciplinary working environment. The virtues of the double dialogue approach to a special edition such as this are many, but most importantly one can see the value of the research on its own terms, and see how it has grown out of a dynamic research environment. We are able to witness directly how contributors have worked with and for each other, and how they are able to accommodate and enrich each other’s understandings of the texts under investigation. By seeing into things in this way, powerful research stories emerge.

The Stories of Seeing into Things

We have chosen to present the articles in this edition in a way that tells a research story, where conversations emerge and narrative arcs progress within and across the work presented. We have ordered them in a way that creates a narrative pattern; one can see ideas and themes introduced in one article picked up and developed in another. The story is also one that moves across screen media, from film to television and from features to serials. The special edition opens with a master shot of the field and closes with a tying up of the narrative threads that have been presented throughout the special edition. That is not to say, as previously noted, that each article does not stand as discrete research, but to recognise the beautiful truth of bringing overlapping and communicative research stories together like this.

The stories of Seeing into Things are also about the research environment that has been cultivated through the work of the Eye Tracking and the Moving Image Research group, and in the process of putting this special edition together. New international research relationships have been fashioned; and new friends have been made. We find in the inter-disciplinary stories of this special edition a range of content, styles and approaches in a deliberate attempt to engage readers (other researchers and practitioners) in recognising the power of crossing the research line.

Adrian G. Dyer (a vision scientist) and Sarah Pink (a visual anthropologist and ethnographer) open the edition with a critical, holistic overview of eye tracking research in relation to the screen. In Movement, Attention and Movies: the Possibilities and Limitations of Eye Tracking? Dyer and Pink suggest that film narrative and the conditions of viewing have a significant influence on gaze relations and subjectivity, but that there is yet limited work on the complexities and variables of such connections and alignments. Drawing upon their own research fields in vision science and anthropology and ethnography, Dyer and Pink demonstrate the value and importance of inter-disciplinary scholarship to understanding the poetics and politics of viewing the moving image. To make their observations they draw upon research carried out by Craig Batty, Claire Perkins and Jodi Sita, whose article naturally follows in this edition.

In How We Came To Eye Tracking Animation: A Cross-Disciplinary Approach to Researching the Moving Image, Batty, Perkins and Sita draw upon their pilot study of eye tracking a time-lapse montage sequence from the film Up (a study that also included Dyer). In their article, they outline how the ‘research journey’ of their project took shape, and they discuss how each of them came to the study from their individual disciplines: screenwriting, screen studies and neuroscience. They suggest that their own discipline backgrounds initially influenced and shaped both their research methodology and also the analysis of the research findings. However, they then point towards the layering of these approaches, as a way to fully discover how the montage scene under analysis can be best understood. This inter-disciplinary approach is fully taken up in the next article.

In Sound and Sight: An Exploratory Look at Saving Private Ryan through the Eye Tracking Lens, Jenny Robinson, Jane Stadler and Andrea Rassell place emphasis on the connection between looking and hearing; or, seeing and sounding. By focusing on sonic aesthetics that, arguably, direct viewer attention as much as any other film aesthetic, they use a sound-on, sound-off methodology to test their hypothesis. The resulting discussion will be as useful to film practitioners as it is to screen and eye tracking scholars.

The question of utility, or practice, is taken up in Jan Louis Kruger, Agnieszka Szarkowska and Izabela Krejtz’s, Subtitles on the Moving Image: An Overview of Eye Tracking Studies. They look towards new cognitive research horizons in the field of audiovisual translation (AVT). Seeing limitations and weaknesses in the current eye tracking research being conducted on subtitling, they argue that attention needs to be directed to the actual processing of verbal information. Drawing upon data gathered from numerous eye tracking studies, they contend that it demonstrates the way shot changes, language and subtitles impact upon cognitive processes, and how this has implications for subtitling and captioning.

Drawing on her doctoral research, Tessa Dwyer also explores subtitling but in relation to the BBC television series Sherlock. Dwyer’s fascinating article focuses upon its use of post-production (though scripted) free-floating text.  In From Subtitles to SMS: Eye-Tracking, Texting and Sherlock, Dwyer offers an in-depth analysis of viewer engagement with the show, exploring notions of reading vs. viewing, and attraction vs. distraction. Dwyer also draws upon ideas raised by Sean Redmond, Jodi Sita and Kim Vincs in their article, Our Sherlockian Eyes: the Surveillance of Vision.

In this article, Redmond, Sita and Vincs offer us a unique interior dialogue as they each read the eye tracking data gathered through their own discipline filters while also dialoging with each other’s approaches. Each author sees the hands of direction, misdirection, movement, surveillance and relationality in the scene under analysis, with agreement that vision is never simply cognitive or anatomical but multi-modal and haptic. They employ eye tracking data in ways that recognise the phenomenological embedded in the viewing experience, and which can be ‘extracted’ from what are normally seen or interpreted as qualitative findings.

The final two articles in this special edition then engage in a different type of dialogue or debate. In William Brown’s Politicizing Eye-tracking Studies of Film, he draws upon the (short) history of eye tracking and the moving image research, and specifically the work of Tim J. Smith, to demonstrate its theoretical and applied limitations. While Brown sees great value in eye tracking research he draws our attention to its obviousness in terms of telling us what we may already know. Nonetheless, Brown also outlines where the research may or should go and supplies instructive illustrations to help us chart new courses and terrains.

In what is also a critical commentary on the articles contained in this special edition, Tim J. Smith responds to Brown’s article, pointing to what he sees are misconceptions. In Read, Watch, Listen: A Commentary on Eye Tracking and Moving Images, Smith reflects on his own ground-breaking work as he also summarises and problematizes the articles in this edition. Working from a position as a cognitive psychologist and from within a version of neo-formalist film criticism, Smith’s position on eye tracking is persuasive if caroled.

Combined, the articles in this special edition reflect on the past, present and future of eye tracking and the moving image research and include critiques of the very nature of research itself. The case study material that the articles draw from is predominately from mainstream film and television texts but these are explored through new vectors. Unlike Smith’s work, the authors extend their utilization of eye tracking data to consider the cultural, the ethnographic and anthropological, the ideological and the phenomenological, albeit within the house of film and television aesthetics and genre. We hope that other researchers will draw inspiration and insight from the studies undertaken in this edition.

Future Research Directions

As indicated, the Eye Tracking and the Moving Image Research group features practitioner-academics who are interested in how research can be both carried out and disseminated through creative practice. As two of the articles in this edition signal, there is interest and expertise in sound design and scoring, and in screenwriting. Both of these aspects, which speak to the broader gamut of film and television-making practices, have found natural positions within the research undertaken to date, and also feature in two forthcoming book chapters authored by the group’s members. What is of special interest to us in the near future is how we might use these practices to further develop research methods and research outputs. For example, rather than relying on pre-existing moving image texts, what if we were to make our own? How might we use specific practices – sound, screenwriting – in order to influence the eye tracking experiments that we conduct?

One idea is for the group to make one or more short films in order to test patterns of viewer engagement, where narrative and aesthetics are controlled by the researchers, thus becoming a creative practice research variable. Another idea is to analyse eye tracking data alongside aspects such as the score and the screenplay, in order to make original connections between the source text – intentionality – and its reception. We should also consider how the scientific data provided by the research – heat maps, gaze plots, etc. – might be used as the basis of a creative work in and of itself, such as an artwork or another moving image text. Andrea Rassell, Sean Redmond, Jodi Sita and Darrin Verhagen are currently engaged in public projection and installation projects that use the colours, spirals and vortexes of eye tracking data within thematised artworks.

The make-up of the group has also resulted in some interest in how viewers engage spatially and environmentally with moving image texts, posing questions such as: does the viewing environment alter gaze patterns? How might room set-up and screen size change where and for how long people look at a defined area of interest? In this way, the group might seek to add ethnographic methods to the studies that take place, allowing us to add another set of research variables that could produce interesting and original results. Depending on the context, this type of research would also be of use to the screen industry – distributors, cinema groups, screen manufacturers, interior designers, etc.

We are mindful, nonetheless, where others might take eye tracking research. It is already being used in the commercial moving making industry and one of the worries is that it will become a device to reduce production costs as filmmakers use the data to literally paint the screen by numbers. Film and television are artforms, they beautify the world and they enrich our lives. All members of the eye tracking and moving image research group want to employ eye tracking technology to get to know and understand this beauty, and to fully comprehend what the viewer sees, hears and feels when they watch Fred Astaire dance, Ryan Gosling seduce, or Sherlock deduce and detect.


Hasson, Uri, Landesman, Ohad, Knappmeyer, Barbara, Vallines, Ignacio, Rubin, Nava, and Heeger, David J. 2008. Neurocinematics: The neuroscience of film. Projections, 2(1), 1-26

Sobchack, Vivian.2000. What My Fingers Knew: The Cinesthetic Subject, or Vision in the Flesh, Senses of Cinema, Issue 5, available at: (accessed 19th January, 2015).