Human Capital Theory and Research Productivity A number of studies have successfully applied this human capital model to the research productivity of university faculty. Young faculty members often face intense pressure to publish but, as promotions are gained and years progress, research output often falls. McDowell (1982) finds humped-shaped relationships between age and research output for a sample of faculty members in a cross-section of disciplines. Diamond (1986) and Levin and Stephan (1991) find similar relations for mathematicians and scientists, and Goodwin and Sauer (1995) find the same for a sample of academic economists. Webber (2012) finds that the of years since terminal degree to have negative effect on publication, and Tien and Blackburn (1996) find that publication rates tend to rise as professors approach promotion and fall after promotion. Other authors report similar declines in research productivity in later life (Galenson and Weinberg 2000; Jones 2010; Jones and Weinberg 2011; Oster and Hamermesh 1998; Stroebe 2010). Stroebe (2010) notes several additional factors that might account for the apparent drop in output at later ages. For example, universities might reallocate resources away from older faculty members and toward younger researchers that they hope to attract and encourage. Also, if mandatory retirement looms in the near future, older faculty members might be less motivated to keep current and continue to produce as effectively. Finally, because older faculty members are likely to have the security of academic tenure, they might be more inclined to shirk and relax. Studies also find differences across disciplines and across genders. For example, the rate at which prior knowledge becomes obsolete probably varies by field. According to McDowell (1982) the rate of “literature decay” in hard sciences such as physics and chemistry is significantly higher than that in humanities such as history and English. He also argues that because women are more likely than men to interrupt their careers for reasons such as child care, they tend to gravitate to fields in which knowledge is more durable. If so, women might suffer less than men from declining productivity as they age.

Age and Teaching Effectiveness Although fewer researchers have studied the effect of age on productivity inside the classroom, human capital theory would predict a similar hump-shaped, non-linear relationship. As younger teachers gain experience their classroom performance should improve, but other factors should eventually push in the opposite direction. For example, the teaching prowess of older faculty members might suffer from an inability to stay current in their fields, or they might be allocating relatively more time to administrative rather than classroom pursuits (McPherson et al. 2009). Also, students might find it easier to connect with faculty members closer to their own age. Complaints by older faculty members about not understanding the younger generation certainly ripple through the halls of academia with some frequency. However, measuring the relationship between age and teaching productivity is challenging. The quantity and quality of research outputs can be measured with some objectivity, but defining, much less measuring, the effectiveness of classroom instruction surely ranks among the more controversial issues in education. Most researchers employ various measures of value-added, typically changes in student test scores, but others quarrel with this approach. For example, Corcoran (2010) argues that value-added measures can be biased by a variety of random factors such as family events, student health, the presence of disruptive classmates and even the effect of what students learn in other classes. Baker et al. (2010) find that teacher rankings based on value-added measures can fluctuate wildly from year to year. Since true teacher quality is unlikely to vary significantly from year, these measures probably are biased and unreliable indicators of classroom effectiveness. Rothstein (2015) adds that value-added measures across different tests in the same discipline correlation are only weakly correlated and that the correlation between changes in test scores and other types of performance measures is weaker still. He also contends that teachers can impact value-added measures by teaching to the test and that these measures ignore important non-cognitive skills. On the other hand, recent studies by Chetty et al. (2014a, b) find that such biases are small and that value-added measures correlate well with long-run student success. We do have extensive datasets that allow us to trace and analyze value-added measures through time for elementary and secondary school children and, when included in research studies, teacher experience almost always turns out to be a significant determinant of student achievement (Harris 2009). As predicted by human capital theory, productivity gains seem especially strong in the first few years of a teacher’s career (Clotfelter et al. 2007; Jackson and Bruegmann 2009; Rockoff 2004), but recent studies by Ost (2014), Papay and Kraft (2012), and Wiswall (2013) conclude that teachers continue becoming more productive for a larger number of years. Thus, to the extent that age correlates with experience, K-12 teaching effectiveness does seem to increase with instructor age for at least some period of time. The link between age and teaching effectiveness in higher education has proven to be more problematic. Judging the effectiveness of classroom instruction is quite difficult and the kinds of longitudinal data on test scores used in K-12 studies rarely exist for university-level students. Though quite controversial and subject to potential bias from a variety of factors, most schools rely on student evaluations of teaching (SET) as a primary measure of teaching quality (Denson et al. 2010). However, despite an enormous literature on the subject, the evidence supporting their use is mixed. For example, in his meta-analysis, Clayson (2009) finds a positive, albeit weak, correlation between SETs and various objective measures of value added across different sections of a course but finds no significant relationship for students within a section. In other words, students in sections taught by highly rated instructors do seem to learn more, but those students showing the largest gains in knowledge do not rate their instructors any better than do other students in the same class with smaller knowledge gains. Carrell and West (2010) find that SETs for a sample of instructors at the U.S. Air Force Academy are positively related to contemporaneous student learning, but were negatively related to student achievement in subsequent courses. Galbraith et al. (2012) identify a variety of methodological problems that might bias prior analyses and, using a large sample of business courses, find no significant relationships between SETs and achievement of student learning outcomes. They conclude that there is little reason to believe that SETs serve as a valid indicator of teaching effectiveness. But many other researchers do find positive and significant correlations between SETs and indicators of student learning (Beleche et al. 2012; Centra 1993; Davis 2009; Marsh 1984; Marsh and Roche 1997). Studying a sample of medical students, Stehle et al. (2012) find that SETs have a strong positive correlation with results on a practical examination, even though they were unrelated to scores on a multiple-choice test. Davis (2009) concludes that “students of highly rated teachers achieve higher final exam scores, can better apply course material, and are more inclined to pursue the subject subsequently” (p. 534) and Benton and Cashin (2014) write that “In general, student ratings tend to be statistically reliable, valid, and relatively free from bias or the need for control, perhaps more so than any other data used for faculty evaluation” (p. 12). Other studies have identified additional factors that impact SETs and several (Kinney and Smith 1992; McPherson et al. 2009; Wiswall 2013) have modelled SETs as a function of various instructor and course characteristics. Failure to control for these characteristics might create bias and probably accounts for some of disparate results found in the literature. For example, in addition to age, SETs might be impacted by an instructor’s gender and physical appearance. While there is some agreement that both male and female students tend to give higher ratings to instructors of their same gender (Centra and Gaubatz 2000), the overall effects are not clear. Some early research with high school teachers showed a slight bias for males to receive higher ratings (Bernard et al. 1981), but more recent studies find slightly higher ratings for females (Feldman 1993; Whitworth et al. 2002) or no significant difference at all (Feldman 1992). However McPherson et al. (2009) conclude that male instructors in their sample of economics instructors received higher ratings than females, For better or worse, physical beauty might also matter. According to Hamermesh and Biddle (1994) workers rated as striking or above average in attractiveness earn a wage premium of about 15 % over those rated as below average or homely. Beauty apparently impacts SETs as well. O’Reilly (1987) found that the physical attractiveness of dental school instructors affected student opinions of their teaching effectiveness. Professors at the University of Texas judged as better looking by students also earned stronger evaluations (Hamermesh and Parker 2005). Perhaps surprisingly, beauty had more of an impact on the ratings of male professors than females. Younger students also seem to prefer attractive instructors as even elementary school children tend to rate good-looking teachers more highly (Goebel and Cashen 1979). Feeley (2002) argues that this phenomenon results from a “halo effect” in which the beauty of the instructor creates a halo whose aura spreads to impact student perceptions of other, non-related characteristics. Course characteristics also impact SETs. Other researchers have found that variables such as class size (Green et al. 2012; Hamilton 1980; McPherson et al. 2009), course level (Braskamp and Ory 1994; Feldman 1978), and whether or not the course was an elective (Feldman 1978; McPherson et al. 2009) affect SETs. In addition, numerous authors have studied the relationship between SETs and course difficulty and expected grades. Bowling (2008) finds that SETs are contaminated by differences in course difficulty and that students give higher ratings to instructors in courses they consider to be easy. Moreover, the positive effect of easiness ratings on course evaluations is stronger in public schools with low academic rankings than in more highly ranked private institutions. Though not all studies agree (Centra 2003), many analyses also conclude that higher expected grades correlate with higher SETs (Blackhart et al. 2006; Braskamp and Ory 1994; Krautmann and Sander 1999; Langbein 2008; McPherson et al. 2009). To the extent that students expecting high grades are those who have learned more and rate their instructors highly as a result, SETs can be a valid indicator of teaching effectiveness. However, the more common interpretation is that instructors are able to buy better evaluations by awarding higher-than-deserved grades. In this interpretation, SETs are a biased indicator of teaching quality. Thousands of articles have been published on the validity of using SETs as a measure of teaching effectiveness, and a comprehensive review of these is beyond the scope of this paper. Many different views can be supported by at least one study. Nonetheless, many studies do conclude that SETs can be valid indicators of teaching effectiveness, at least when controlled for appropriate instructor and course characteristics. Moreover, the ubiquitous use of SETs in promotion and tenure decisions is evidence that administrators believe them to be a primary indicator of classroom performance. Indeed, one recent survey found that department chairs weighed SETs more heavily than any other factor in their overall evaluations of a faculty member’s teaching effectiveness (Becker et al. 2012).

Age and SETs Some studies using SETs as an indicator of teaching quality do find them to be negatively related to instructor age (Meshkani and Hossein 2003; Wachtel 1998), but the literature shows no consistent relationships (Blackburn and Lawrence 1986). Ragan and Walia (2010) find that new instructors get lower ratings, but that this disadvantage changes in a relatively few years. Hamermesh and Parker (2005) find no effect of instructor age in their sample of faculty members, nor do Mardikyan and Badur (2011). Spooren (2010) finds that age has a negative but non-significant effect on SET scores in his sample. The effect of age could be confounded with that of experience but, after adjusting for the positive impact of experience, McPherson et al. (2009) still find that age has a negative effect. They conclude that experience raises effectiveness, but older instructors with 20 years of experience do not fare as well as younger ones with 20 years of experience. In perhaps the most extensive study of age and teaching performance in higher education, Kinney and Smith (1992) find non-linear effects of age on SETs that vary across different academic areas. In their sample, SETs for faculty in the humanities tend to fall for faculty up to about age 50 and then increase slightly. They speculate that these departments might place more relative value on teaching effectiveness that, in turn, encourages faculty to continue honing skills through their later years. They claim also that older professors in these fields might start to become more concerned with the intellectual growth of their students relative to their own. On the other hand, they find that evaluations in the sciences rise until faculty age reached the mid-40 s and then decline continually after that. Their results are statistically significant, but the quantitative effects are small. These studies are suggestive but, because it is extremely difficult to obtain or compare faculty evaluation data from multiple schools, they rely on samples from a single university or, in some cases, a single department within that university. More importantly for our purposes, none used samples containing large numbers of faculty past the age of 64.

Rate My Professors Popular websites such as RateMyProfessors.com (RMP) publish student ratings for a broad and diverse sample of college and university instructors. The RMP site allows students to rate professors on three criteria: helpfulness, clarity, and easiness. The site also publishes an overall quality rating that is the simple average of the ratings for helpfulness and clarity. The site imposes almost no restrictions on who participates and, as Davison and Price (2009) report, fraudulent ratings are an issue. Students can log onto RMP under fake names and rate faculty members multiple times, and faculty members can enter the site and rate themselves or their colleagues as well. Moreover, since students with strong opinions about an instructor probably are more likely to take the initiative to log on and provide ratings, the sample of students rating a particular instructor could be unrepresentative of the population. Nonetheless, a growing body of evidence suggests that RMP ratings closely mirror those of university-run evaluations. Kindred and Mohammed (2005) conclude that student postings on the RMP website accurately reflect the opinions of students interviewed in focus groups about the quality of teaching delivered by their professors. Looking at a sample of 426 instructors at the University of Maine, Coladarci and Kornfield (2007) find a strong positive correlation between RMP ratings and those for corresponding questions on the university-administered evaluations and, based on their work with Brooklyn College faculty evaluations, Brown et al. (2009) conclude that RMP ratings are strong predictors of instructor SET ratings. Timmerman (2008) reports that overall quality ratings on RMP correlate highly with the summary ratings for evaluations in a sample of five different universities and, according to Otto et al. (2008), RMP ratings are consistent with “what would be expected if the ratings were valid measures of student learning” (p. 364). Despite their potential defects, RMP ratings apparently do closely track the SETs that, for better or worse, are commonly used to measure teaching effectiveness.