The Interpersonal Reactivity Index (IRI) is a self-report scale designed in 1980 by Mark H. Davis to assess empathy from a multidimensional perspective, conceptualized as a group of related but independent cognitive and affective processes. It is one of the first tools to assess cognitive and affective dimensions separately, and it has been one of the most widely used instruments in the field (Ilgunaite et al., 2017).
The scale consists of 28 items evenly distributed across four subscales: Perspective Taking (PT) and Fantasy (F), corresponding to the cognitive dimension, and Personal Distress (PD) and Empathic Concern (EC), to the affective dimension. The PD subscale contains items that assess negative feelings arising in the individual when perceiving another person's suffering, while EC evaluates feelings of compassion and a desire to alleviate the other person's suffering. PT assesses the ability to understand that others may have thoughts and feelings different from one's own, and F comprises items related to the tendency to put oneself in the position of fictional characters. The psychometric properties analysis reported by Davis (1980, 1983) shows satisfactory internal reliability of the four scales (ranging from .71 to .77) and test-retest reliability ranging from .62 to .71.
The question regarding the multidimensional nature of empathy has been central to both theoretical discussions and the development of measurement techniques.
Currently, different authors have considered that empathy involves at least two types of processes: cognitive and affective (e.g., Chakrabarti & Baron-Cohen, 2006; Decety & Jackson, 2004; Preston & De Waal, 2002). The former is responsible for making inferences about others' mental states (beliefs, ideas, desires, feelings), while the latter is linked to the ability to detect and/or experience others' emotional states. Despite the fact that multidimensional and integrative models of empathy are currently the most accepted, the discussion about the structure, components, and whether they are independent from each other is still unresolved (e.g., Lawrence et al., 2004; Spreng et al., 2009).
The four-factor model presented by Davis (1983) has been tested in various psychometric studies in normative population yielding mixed and inconsistent results. While in some studies it was successfully confirmed, in others, the fit of the original model is only marginally acceptable or requires some changes like item elimination (Table 1).
Furthermore, some have yielded results that do not confirm the four-factor model, and among these, some authors identify different dimensions than those proposed by Davis (Baldner & McGinley, 2014, 2020; Koller & Lamm, 2015; Wang et al., 2020; Yarnold et al., 1996). Cliffordson (2002) and Hawk et al. (2013) propose that the original four factors contribute to a higher-order general factor, while Pulos et al. (2004) found a structure with a higher-order factor encompassing the TP, F, and EC subscales, while PD constitutes an independent factor.
Another prominent model in the literature is the two-factor model: affective and cognitive, each composed of their corresponding subscales. Although this model is widely used in empathy research, no psychometric studies have confirmed its validity (Chrysikou & Thompson, 2016).
Table 1: Factorial structure: studies that have confirmed the original model (Davis, 1983)

Note: CFI: comparative fit index; TLI: Tucker-Lewis index; RMSEA: root mean square error of approximation; RMSR: square root of the mean of the squared residuals.
Several psychometric studies have found issues with the reversed items of the IRI (Arenas-Estevez et al., 2021; Braun et al., 2015; Garcia-Barrera et al., 2017; Grimaldo et al., 2022; Murphy et al., 2020; Palmese & Schmidth, 2013). The inclusion of reversed items in self-report scales has been debated as a potential threat to validity (Lundgren et al., 2018). Reversed items can lead to interpretation issues (Haladyna, 2002), resulting in the emergence of a method factor (Sonderen et al., 2013; Zhang et al., 2016). Current guidelines for the development and validation of items for tests or assessment instruments recommend wording items in a direct or positive manner, avoiding negative phrases (Haladyna & Rodriguez, 2013).
In Spanish, the issue with reverse-coded items has shown great significance. Venta et al. (2022) evaluated the psychometric performance of reverse-coded items in a Spanish-speaking population in the United States. The results demonstrated a decline in the psychometric performance of these items. When the reverse-coded items were reverted to their original format, the item correlations with their respective subscale scores improved, along with the overall internal consistency of the scale. This confirms that reverse-coded items present more challenges in Spanish than in other languages, supporting the recommendation to avoid their use in scales designed for this language.
The IRI scale is not exempt from these limitations, as it presents nine items worded in a reversed manner (3, 12, 13, 19), with five of them also written with a negation (4, 7, 14, 15, 18). At least five studies in Romance languages-three in Spanish (Arena-Estevez et al., 2021; García-Barrera et al, 2017; Grimaldo et al., 2022), one in Italian (Palmese & Schmidt, 2013), and one in French (Braun et al., 2015)-have reported issues with the reverse-coded items in the IRI.
Among studies conducted in Spanish-speaking contexts, Grimaldo et al. (2022) examined structural validity, invariance, and reliability among university students in Peru. The results from confirmatory factor analysis (CFA) challenged the original four-dimensional model. However, a suitable model fit was attained by eliminating the reversed items and those with negligible variance. In different research, the four items with reverse wording and negation (4, 14, 18, and 15) were excluded, resulting in a satisfactory fit for the four-dimensional model (Palmese & Schmidt, 2013). In another study involving university students from Colombia, CFA revealed poor indicators for nine items. Subsequently, they attempted to enhance model fit by removing the reversed items, which led to favorable fit indicators (Arenas-Estevez et al., 2021).
Another study on the factorial structure of the IRI in incarcerated population found that all reversed-worded items loaded onto a single component (method factor) despite belonging to different subscales. Upon removing these items, the four-component structure originally reported by Davis was replicated (Lauterbach & Hosser, 2007).
In 2004, Beven et al. (2004) administered IRI to 88 violent criminals from maximum-security prisons. They found that negative items were clustered in one component, alongside another positive item, albeit with lengthy and complex wording. The authors suggest that linguistic complexity and potential low level of reading comprehension might impact the responses, as observed in other cases.
In addition to the contributions from psychometric studies, the theoretical relevance of some scales like F and PD has been discussed, considering that they do not align with a current conceptualization of empathy (Baldner & McGinley, 2014). Criticism of the F scale suggests that it does not truly measure what it claims to but instead reflects aspects more related to imagination or self-control (Batchelder et al., 2017; Cliffordson, 2002). Regarding the PD scale, Murphy et al. (2020) found poor fit and argued that it is lacking in terms of construct validity. Furthermore, Cliffordson (2002) posits that this may not be a central component of empathy.
Despite these criticisms, the IRI continues to be a widely used instrument for measuring empathy, with numerous adaptations and translations. Validation and adaptation studies have been conducted in different languages, including Portuguese (Sampaio et al., 2011; Shiramizu & Yamamoto, 2018), Dutch (De Corte et al., 2007), Chinese (Siu & Shek, 2005), German (Paulus, 2009), Kannada (Rajput et al., 2020), French (Gilet et al., 2013), Farsi (Yaghoubi Jami & Wind, 2022), Swedish (Cliffordson, 2002), Korean (Kang et al., 2009), Russian (Budagovskaia et al., 2017), and Italian (Diotaiuti et al., 2021; Ingoglia et al., 2016). For Spanish-speaking populations, adaptations have been conducted in Spain (Lucas-Molina et al., 2017; Mestre-Escrivá et al., 2004; Pérez-Albéniz et al., 2003), Argentina (Muller et al., 2015), Colombia (Arenas-Estevez et al., 2021; Garcia-Barrera et al., 2017), and Chile (Fernández et al., 2011).
In the Spanish adaptation used in the present study (Pérez Albéniz et al., 2003), adequate reliability values were obtained using Cronbach's Alpha as the parameter. In a sample of 515 students, for males, values of .73 were obtained for PT, .76 for F, .68 for EC (including item 13), and .70 for PD. For females, values of .75 were obtained for PT and F, .70 for EC (including item 13), and .72 for PD.
In sum, the IRI is characterized by proposing a multidimensional assessment of empathy, contemplating affective and cognitive aspects across four subscales. Thus far, various psychometric studies have failed to yield consistent results regarding this model. Furthermore, several studies have demonstrated issues with the reverse-worded items. This study aims to analyse evidence of content, construct, and convergent validity of the Spanish-language version of the IRI (Davis, 1980; Pérez Albéniz et al., 2003) in a sample of Uruguayan adults. Specifically, we aim to test the fit of the four-factor model in this sample, along with other relevant models proposed in the literature (Chrysikou & Thompson, 2016; Cliffordson, 2002; Hawk et al., 2013; Pulos et al., 2004).
Method
Participants
The sample comprised 858 Uruguayan participants, 640 females and 218 males, ranging from 18 to 90 years old (M= 34.27; SD= 15.43), with medium to high socioeconomic level. Of the total participants, 59.6 % completed secondary education and 40.4 % completed tertiary education.
Instruments
Interpersonal Reactivity Index (IRI) (Davis, 1983; Pérez Albéniz et al., 2003): a self-report scale used for studying empathy that consists of four subscales, with seven items each that independently assess cognitive and affective aspects of the construct. Empathic Concern (EC, items 2, 4, 9, 13, 14, 18, 20, 22) and Personal Distress (PD, items 6, 10, 17, 19, 24, 27) subscales assess the affective dimension, whereas Perspective Taking (PT, items 3, 8, 11, 15, 21, 25, 28) and Fantasy (F, items 1, 5, 7, 12, 16, 23, 26) evaluate the cognitive aspects of empathy. Respondents use a five-point Likert scale (A: Does not describe me well to E: Describes me very well). The original study (Davis, 1983) reported adequate reliability values for the four subscales (F = .75, PT = .75, EC = .72, and PD = .78). In this study the adapted Spanish version by Pérez-Albéniz et al. (2003) was used, which demonstrates reliability values ranging from .60 to .78, with an adjustment made to item 13 being part of the EC subscale, not in PD as in the original model.
Toronto Empathy Questionnaire (TEQ) (Spreng et al., 2009): a self-report questionnaire designed for assessing empathy that comprises 16 items, eight negatively worded and eight positively worded. Respondents use a five-point Likert scale (0: never to 4: always) to indicate the frequency with which they feel or act in a particular manner. Psychometric properties of the TEQ demonstrate a good fit to a unidimensional model, encompassing a single-factor structure comprising 16 items, each with a factor loading exceeding .40 and high reliability reported with the α value of Cronbach's .85 (Spreng et al., 2009). In this study we used the version translated into Spanish and validated for the Uruguayan population (Carballo et al., 2023). TEQ-R scale demonstrated the following fit indices: CFI = .932, TLI = .905, χ²(20) = 213.58, p < .001, χ²/df = 10.65, RMSEA = .106, and SRMR = .045. The reliability index (McDonald's Omega) was .82 for de full scale and .90 for the reduced one (TEQ-R). In this sample reliability was ω= .804 for de full scale and ω = .782 for TEQ-R.
Socioeconomic Level Index (INSE) (Perera, 2018): a questionnaire developed in Uruguay to assess the household’s socioeconomic level. The reduced version consisting of six items was used, allowing for sorting households based on their socioeconomic level, inferring consumption capacity from socio-demographic information and possession of both tangible and intangible assets.
Procedure
An instrumental study (Montero & León, 2002) was carried out using Classical Test Theory (Muñiz, 2010) aiming to adapt the IRI instrument to Uruguay.
The content validity study was based on the expert judgment procedure. The total number of items (28) was submitted for consideration of three expert judges in the area of affective processes: a doctoral psychologist specialized in affective regulation, a doctoral psychologist specialized in emotional regulation, and a psychologist with a doctorate in biology and expertise in empathy research. First, the sufficiency criterion was evaluated with the whole scale. Then, each item was assessed in terms of language clarity, theoretical coherence, and relevance, using a four-point Likert scale (Escobar-Pérez & Cuervo-Martínez, 2008).
The sample was collected through non-probabilistic sampling using the snowball technique, by disseminating the study through social networks and inviting university students to participate. After accessing and accepting the informed consent, participants completed the IRI, TEQ, and INSE scales using the Qualtrics platform (Qualtrics, 2021).
The procedure, consents, and protocols have been approved by the Ethics Committee of the Catholic University of Uruguay, complying with the country's regulations on human research as governed by Executive Decree 001-4573/2007 and Law No. 18331 on Data Privacy, regarding the protection of personal data.
Data Analysis
To assess the content validity of the instrument, the Content Validity Coefficient (CVC) proposed by Hernández-Nieto (2002) was employed, applying the least stringent criterion for item retention. CVC values below .70 were considered unsatisfactory, while values above .80 were regarded as highly satisfactory. Subsequently, a statistical analysis of the items was conducted, evaluating their psychometric quality through the calculation of the mean, variance, skewness, and kurtosis.
Regarding the confirmatory factor analysis (CFA), the weighted least squares mean and variance adjusted (WLSMV) estimator was used, as it is recommended for ordered categorical data due to its robustness and the fact that it does not assume a normal distribution of variables (Flora & Curran, 2004; Freiberg et al., 2013; Li, 2016). Several factorial structures were tested, including the original four-factor solution proposed by Davis (1980), a second-order factorial structure, a two-factor model distinguishing between the cognitive (CO) and affective (AF) dimensions, a three-factor model comprising PT, EC, and PD, a three-dimensional structure including PT, F, and EC, and finally, a two-dimensional model including PT and EC. Finally, the same factorial structures were tested after removing the negative items.
Model fit was assessed using absolute fit indices, such as the chi-square test, the chi-square/degrees of freedom ratio, and the root mean square error of approximation (RMSEA), as well as incremental fit indices, including the comparative fit index (CFI) and the Tucker-Lewis index (TLI). In general, a good fit is indicated when the chi-square value is nonsignificant (p ≥ .05) or when the chi-square/degrees of freedom ratio is below 2 or 3 (Schreiber et al., 2006). Furthermore, according to Hair et al. (2013), for samples exceeding 250 participants and scales containing between 12 and 29 items, CFI and TLI values equal to or greater than .92, as well as RMSEA values equal to or lower than .07, are recommended as cutoff points for evaluating model fit.
The reliability of the instrument was estimated using McDonald's omega coefficient (McDonald, 1999), with values ranging between .70 and .90 considered indicative of adequate reliability (Campo-Arias & Oviedo, 2008; Dunn et al., 2014).
Regarding concurrent validity evidence, the normality of the scores was assessed using the Kolmogorov-Smirnov (K-S) test, considering a distribution to be normal when p > .05. The scores obtained in the IRI dimensions were correlated with the full version of the Toronto Empathy Questionnaire (TEQ; Spreng et al., 2009) and the abbreviated version adapted for Uruguay (Carballo et al., 2023). The Spearman’s rho coefficient was used for the correlation analysis, with interpretation criteria based on Akoglu (2018), where r ≥ .20 indicates a low correlation, r ≥ .50 a moderate correlation, and r ≥ .80 a strong correlation. Finally, effect size was calculated using G*Power software (Faul et al., 2009), considering values between 0.10 and 0.30 as small effects, between 0.30 and 0.50 as moderate effects, between 0.50 and 0.80 as large effects, and values greater than 0.80 as very large effects (Ferguson, 2009).
All analyses were conducted using Mplus version 8.4 (Muthén & Muthén, 1998-2011) and SPSS v.29.
Results
Content validity and item analysis
Table 2 reports CVC and descriptive statistics with a normality test. Results of the agreement procedure among judges showed very good values in the criterion validity coefficients. All items exceeded the acceptable threshold (CVI > .70), and only four out of the 28 items had CVC values below .80; while the remaining 86 % exhibited excellent CVC indices > .80. Normality test, using the Kolmogorov-Smirnov statistic, rejected the normality hypothesis, indicating that IRI items do not conform to a normal distribution.
Mardia's analysis (Mardia, 1970) was employed to assess multivariate skewness and kurtosis. The results indicated a skewness coefficient of 57.36 (df= 816, p= 1.00) and a kurtosis coefficient of KI = 338.65, with a p-value < .000, suggesting a lack of multivariate normality in the data. Regarding the IRI items, skewness coefficients ranged between -2 and 2, except for item 18, which exhibited severe skewness (KI > 3) and a kurtosis value exceeding eight (Kline, 2005).
Confirmatory Factor Analyses
Six different factorial structures were tested. Table 3 presents the fit indices for each model using the full IRI scale. None of the models exhibited a satisfactory fit. Given the previously reported issues with negatively worded items (Arenas-Estevez et al., 2021; Beven et al., 2004; Grimaldo et al., 2022; Lauterbach & Hosser, 2007; Palmese & Schmidt, 2013), the analyses were repeated after removing these items from the scale.
Table 4 presents the results of the confirmatory factor analysis (CFA) after removing the reversed items. Three models achieved acceptable fit indices: Model 1 (corresponding to the original four-factor model), Model 6 (excluding the PD subscale), and Model 7 (including the scales that assess the cognitive and affective dimensions of empathy).
Since the original model has the strongest theoretical and empirical support in literature, we decided to retain it for further analysis. Table 5 presents the final model, detailing the retained items within each subscale and their respective factor loadings.
Table 6 presents the descriptive statistics for the four subscales of the IRI for the overall group and by gender, based on the four-factor model without reversed items.
Table 3: Goodness fit indices for models of the complete IRI scale

Note: CO: cognitive; AF: affective; χ 2 : chi-squared; df: degree of freedom.
Table 4: Goodness fit indices for models of the IRI scale without negative items.

Note: CO: cognitive; AF: affective; χ 2 : chi-squared; df: degree of freedom.
Finally, Table 7 presents the convergent validity analysis of the IRI without negatively worded items, examining its correlations with both the full version of the TEQ empathy scale (Spreng et al., 2009) and its short version (Carballo et al., 2023), as well as the intercorrelations among its subscales. Significant positive correlations were found between the PT, F, and EC subscales, while no correlation was observed between PT and PD. The four-factor model for the scale without reversed items demonstrated adequate consistency, as indicated by McDonald's omega coefficients. The reliability of the four subscales ranged from .75 to .84, which is considered acceptable.
Table 7: Convergent validity study with Toronto Empathy Questionnaire (TEQ), reliability and intercorrelation of the subscales

Note: PT: perspective taking; F: fantasy; EC: emphatic concern; PD: personal distress; TEQ_R: reduced scale. McDonald omega of each IRI subscale is presented on the main diagonal. **p < .01
Discussion
The study of empathy has gained significant relevance in psychology, impacting the way it is conceptualized and measured. A multidimensional approach is key for accurately assessing empathy, making the validation of widely used instruments like the Interpersonal Reactivity Index (IRI) crucial. The IRI, developed in 1980, evaluates both affective and cognitive empathy, though its original version treats these dimensions separately. Despite its widespread use, studies attempting to replicate the original factorial structure have yielded inconsistent results, with some showing poor fit indices. Psychometric research suggests that modifications to certain items are often necessary to achieve better model fit, and alternative factorial structures have also been proposed.
This study aimed to test the original four-factor model and some of the alternative models frequently seen in the literature. The results did not show a good fit for any of the models using the complete 28-item scale.
Several studies had previously noticed issues with the inverted items on this scale, achieving better results upon their removal (Arenas-Estevez et al., 2021; Beven et al., 2004; Braun et al., 2015; García-Barrera et al., 2017; Grimaldo et al., 2022; Murphy et al., 2020; Palmese & Schmidt, 2013). The inclusion of inverted items in psychometric scales has been heavily criticized. It is common in scale construction to formulate positively and negatively worded items to avoid response acquiescence or biases, although this strategy introduces more problems than solutions (Muñiz & Fonseca-Pedrero, 2019; Suárez et al., 2018). The effects of inverted items on the factorial structure of the scale have already been studied (Tomás et al., 2012), mostly concluding that these scales end up being affected by method variance (Conway, 2002). Method variance is a form of systematic error, introducing extraneous variables related to the measurement method rather than the trait being measured (Campbell & Fiske, 1959) impacting the psychometric characteristics of the scale (Tomás et al., 2010). Particularly, reliability deteriorates, and the unidimensionality of each evaluated dimension of the test is compromised by secondary sources of variance (Suárez et al., 2018; Woods, 2006), often resulting in the emergence of spurious factors or method factors that are not substantially meaningful concerning their semantic representation (Woods, 2006).
In Spanish, issues with reverse-coded items have been found to be more significant than in other languages, and it has been recommended to exclude them from questionnaires to avoid compromising the validity of the scale. Some explanations of why these items may be more problematic in Spanish point to the grammatical complexity of the language, which makes negative sentences more difficult to interpret. Additionally, reverse-coded items tend to increase cognitive load, which is further heightened in Spanish, potentially leading to misinterpretation. Furthermore, Romance languages generally avoid the use of negative constructions in everyday speech, making such items less intuitive for respondents (Venta et al., 2022). Another complicating factor is the translation of quantity and frequency adverbs such as "very," "usually," and "often." These could alter the perceived intensity of an action within a sentence. For instance, the English adverb "usually" may convey a different intensity than its Spanish equivalent, "normalmente." Such discrepancies can distort self-reported responses-an issue that is critical for validating translated constructs but has received little attention in research (Arenas-Estevez et al., 2021). The authors emphasizes that these translation challenges can significantly impact how empathy constructs are understood in Spanish-speaking populations, further complicating the psychometric properties of tools like the IRI.
Considering all these antecedents and the poor fit obtained in the present study, the inverted items were removed and the five models with the reduced scale were retested. In this case, a good fit was found for two models: the original four-factor model, and a model that eliminates the PD subscale. Our results confirm previous studies on the IRI (Arena-Estevez et al., 2021; García-Barrera et al., 2017; Grimaldo et al., 2022), highlighting the need to review reverse-coded items in Spanish versions of the scale.
Regarding PD subscale, it has been previously questioned by other researchers (Koller & Lamm, 2015; Murphy et al., 2020), together with its theoretical relevance, which has also been called into question concerning its inclusion as a central aspect of empathy, considering that it might measure an aspect more related to emotional dysregulation or neuroticism. Furthermore, a lack of correlation has been reported between this subscale and other measures of empathy (Carballo et al., 2023).
Taking into account the psychometric indicators, both models -the original and the three dimensional model without PD- achieved good fit. As it is an instrument with four independent subscales, the presence of the PD scale does not impair or limit the others; therefore, its elimination at this point might not be necessary. It is recommended to continue investigating its theoretical relevance, considering the observations made by some of the aforementioned authors.
When comparing mean scores by gender, we found results consistent with previous studies (e.g., Braun et al., 2015; Davis, 1983; Pang et al., 2023). Pang et al. (2023) conducted a comprehensive investigation into sex/gender differences in empathic ability through three distinct studies, employing both large-sample self-report questionnaires and electroencephalography (EEG) measures. While self-reports showed higher empathy scores in women, particularly in PD and EC, neurophysiological data revealed no significant differences in neural responses to others' pain, suggesting a possible influence of social biases on subjective responses. Aligned with prior research, in our sample, women tended to obtain higher self-reported empathy scores.
This study has limitations. The sample is not fully representative, as it is biased toward a higher proportion of women, young individuals, and those from middle to high socioeconomic backgrounds. However, it meets the recommendation of using a sample size greater than 400 for categorical data (Mundfrom et al., 2005). On the other hand, at the psychometric level, some researchers note that there are no clear suggestions regarding the application of fit indices when analysing categorical variables, and conventional cutoff rules for categorical data have not yet been adopted (Xia & Yang, 2019). Garrido et al. (2016) studied the performance of the four commonly used fit indices (CFI, TLI, RMSEA, and SRMR) for estimating the number of factors to retain with categorical data, comparing them with the current gold standard of fit and the parallel analysis procedure by Horn (1965). They found that the CFI and TLI indices provide nearly identical estimates and are accurate fit indices, followed one step lower by the RMSEA. They do not recommend using SRMR as it provides deficient estimates.
Many authors consider it excessive to solely emphasize statistics based on fit indices (Marsh et al., 2005). Although fit indices provide useful information for assessing model fit to data, there are several notable limitations. Simulation studies suggest that implications of cutoff values change when manipulating sample size and load (Stone, 2021). Stone (2021) suggests not to rely exclusively on conventional fit indices that rigidly assess model fit to data. Three procedures should be considered: analysing conventional fit indices, analysing the relative fit procedure by testing fit in different models and selecting the best-fitting one, and lastly, using theory and logic to determine which models better fit to select a theoretically justifiable model.
Considering the obtained results and the study limitations, it is recommended to review the cost-benefit of including inverted items. Future studies could work on directly wording the inverted items in the IRI. Moreover, it is encouraged to continue reviewing the theoretical and empirical relevance of the Personal Distress subscale and the adequacy of the original four-factor model.
Based on the results of this study, a shortened 16-item version of the IRI has been developed and is ready for use with the Uruguayan population. This marks the first short and valid psychometric instrument for assessing empathy from a multidimensional perspective in Uruguayan adults.

















