Limitations of Direct Observation

Watching teachers at work may not be the best way to evaluate performance

December, 2016 Charles Maranzano

Accountability for the performance of public school teachers nationwide has fueled widespread educational reforms in the past decade. Responding to politically generated directives regarding standardized testing and teacher performance, many states revised teacher evaluation measures. A primary focus of these reforms attempts to systematically quantify student progress and at the same time incorporate new processes for teacher evaluation.

In order for schools to comply with federal mandates, an abundance of new evaluation instruments for teachers have been placed in use across the country. Emerging systems for teacher evaluation prioritize direct observation of teacher actions and serve as a major basis for teacher performance ratings.

School districts also are under increasing pressure to gather data about teacher performance as part of a related effort to address teacher tenure and in some cases are tied to decisions about pay increments.

How valid and reliable are evaluation systems that rely primarily upon direct observation to reach quantifiable decisions about teacher performance? The use of direct observation as a major component of an overall data collection system for evaluative purposes may be less reliable than thought. Here is why.

Open vs. closed systems

Open systems of recording teacher actions were widely utilized in the 20th century classroom and were intended to offer a rich qualitative description of classroom events used as a basis for evaluations or assessments. These narratives were used to analyze teacher behaviors and frequently used in a conference between and evaluator and teacher or in a peer review. Narratives were frequently used to draw conclusions about summary ratings or judgements.

In contrast, the closed systems of today are collective and focus on specific types or aspects of teaching behaviors. These systems generally include categorical or sign systems, as well as behavior checklists and performance rating scales.

Checklists are typically set up so that the observer indicates the presence or absence of given behaviors during a lesson or during a specified time interval in a lesson. Category systems contain mutually exclusive categories that are applied to behavioral events that are generally placed in one particular domain or area.

Educational process-product research is generally devoid of persuasive educational theory regarding what constitutes exceptional teacher performance and has failed to validate why a particular set of teaching behaviors influences student outcomes. Research validates concerns from educators about why some variables are selected to be observed at the exclusion of others. This is particularly problematic for teachers of specialized subjects embedded in the curriculum like the fine and performing arts, foreign languages, physical education, philosophy, ethics, and a host of advanced level high school courses.

Limited use for deeper understanding

The problem with widely used observation systems is that they apply to teaching behaviors independent of the curricular context. Observations generally focus on isolated behaviors, and do not take into account preceding and subsequent behaviors that support teaching decisions observed at a moment in time. There is little agreement concerning the appropriate number of observations that are required to measure authentic instructional practices and when during a teaching cycle observations should occur.

The exclusive use of direct observation as an evaluation procedure presumes that observable, overt teaching actions provide a sufficient basis for judging a teacher’s adequacy, even though teaching may not be just a set of observable performances or behaviors.

In general terms, observation may be good at looking at behavior and actions, but is very limited for gaining deeper understanding of teaching or outcomes. Classroom observation typically leaves out direct systematic evidence about teacher planning, teacher assessment, instructional context, and modification of instructional materials.

Excluded from view are teacher choice and adaptation of instructional methods, and a teacher’s working relationships with colleagues, parents, and members of the school community. Absent from observable factors are contributions teachers of highly specialized subjects make outside the classroom setting such as teachers of music, drama, dance, or physical education who interact frequently with various publics outside of the traditional school schedule.

Observers change teacher behavior

Measurement conditions known to influence the reliability of direct observation systems include observer training and experience, instruments that require high or low inferences, clarity of instrument categories, and the specific curricular background of the observer. Subject matter, ability level of students, grade level, diversity of participants, lesson type, time of day, and objectives or goals of instruction account for systematic variation in teaching interactions and instructional arrangements.

In addition, the specific topic of instruction, characteristics of pupils observed, and other situational factors affect instructional practice and decisions. A teacher’s philosophy, district policies and procedures, facilities used for instruction, available resources and other institutional factors all influence variation in instruction.

Under the best conditions, the informational value of direct observation has limitations and is related to methodological issues that impact valid conclusions. Of primary concern are systematic observation protocols that contribute to the obtrusiveness of direct observation. For example, the presence of an observer has been known to change the behavior of teachers or students contributing to reactive effects. These reactive effects may occur because teachers and students are aware that their behaviors are being observed. Teacher or student anxiety can interfere with the drawing of valid inferences about what should normally occur in the classroom.

A single snapshot

Lack of reliability in observation settings is frequently a function of unclear definitions, the level of difficulty regarding observer judgements or inferences, insufficient training, observer fatigue, and the complexity of the observation protocol and behavioral complexity. Efforts to develop instruments that are resistant to evaluators’ rating errors have failed to provide more reliable appraisals.

Research affirms many common problems in direct observation: contrast effects, first-impression errors, halo effects, similar-to-me-effects, central-tendency errors, and observer expectations. In most cases, the informational value of direct observation depends on the type of instructional events observed and the ability of competent observers to evaluate accurately the quality and depth of subject matter taught.

The amount of time invested in the practice of direct observation when compared to the totality of teacher-student contact is so extremely low that it is no surprise that validity and reliability are questionable. In many instances, the actual time teachers are observed amounts to less than one hundredth of one percent of total teacher-pupil contact time in a given year or cycle. A snapshot of teaching is at best the only clear result of direct observation.

What is needed to produce valid conclusions about teaching effectiveness is far more complex than this snapshot. Rather, a full motion picture rich with a range of multiple and varied points of data may be required to begin to inform the record of what constitutes valid and reliable evaluation in today’s classrooms.

Educators deserve much better analysis of teaching behaviors based upon valid and reliable statistical data. Points of data need to be rich and varied in order to draw inferences about teacher actions and performance. Many more inclusive views of teaching and learning may need to become a part of the overall process for teacher evaluation. Administrators are called upon to play a greater part in the formative process of teacher development and as a result must redefine their contributions when it comes to summative evaluative judgements.

Until more reliable and valid systems for teacher evaluation evolve nationwide, the use of direct observation should be minimized and replaced with multiple points of data. Teachers and students deserve a process that respects both the quantifiable and qualitative aspects of teacher-student interactions and learning. Until then, outdated processes for reaching conclusions about teacher performance need to be reconsidered and re-conceptualized.

Charles Maranzano ( is the chief school administrator with New Jersey’s Lebanon Borough Public School.

