Impact of data source choice on multimorbidity measurement: a comparison study of 2.3 million individuals in the Welsh National Health Service

Study design and population

This cross-sectional study used routinely collected anonymised data available in the SAIL Databank and consisted of individuals of all ages living in Wales and registered with a GP contributing data to the Secure Anonymised Information Linkage (SAIL) Databank on 1 January 2019. Intentionally, this study examines condition coding outside the COVID-19 pandemic to avoid capturing the effects of related restrictions and associated decreases in the diagnosis of physical and mental health conditions [15]. The study population was limited to people with at least 1 year of GP registration before 1 January 2019 to improve the stability of records and avoid under-ascertainment where an individual has recently moved practice and their PC record has not yet been populated with historic codes [16] and to those registered with GP practices who contribute data to SAIL Databank (80% of GP practices and 83% of Welsh residents [17]). The population was stratified into groups according to age, sex, and deprivation status of neighbourhood residence (using deciles of the Welsh Index of Multiple Deprivation [WIMD] 2019) [18]. Mortality was measured in the subsequent calendar year (to 31 December 2019).

Data sources

PC data obtained from the Welsh Longitudinal General Practice Dataset (WLGP) were used to define conditions using Read version 2 codes (SNOMED-CT codes were not operational in the SAIL Databank during the study period), prescribing and/or laboratory data [19]. HI data were derived from general and psychiatric HI episodes obtained from the Patient Episode Database for Wales (PEDW) using all recorded International Classification of Diseases 10th Revision codes present for each hospital discharge [20]. PEDW records hospital inpatient events for English hospitals where a patient is registered with a Welsh GP; however, neither PEDW nor WLGP will provide data for patients prior to when they registered with a Welsh GP. Unlike Hospital Episode Statistics (HES) that records admissions, A&E attendances, and outpatient appointments from NHS England, PEDW records hospital inpatient episodes only [21]. Mortality data were derived from the Welsh Demographic Service Dataset.

Definition of long-term conditions

Choice of the 47 conditions was based on results of a recent Delphi consensus study recommending those to include in the measurement of multimorbidity (Additional file 1) [14], and multimorbidity was defined as the presence of two or more conditions [1]. Phenotype definition and look-back duration for the codes defining each of the conditions followed rules defined by Barnett et al. [22] where possible. For the remaining conditions, inclusion criteria were agreed through discussion between authors CM, SWM, and BG. In certain cases, look-back durations varied within conditions to reflect the impact living with the condition was likely to have on an individual. For example, anaemia was defined as a relevant code ever recorded for aplastic anaemia, sickle cell anaemia, and thalassaemia (conditions that are either life-long or life-threatening), but as a relevant code dated in the last 12 months for iron-, B12-, or folate-deficient anaemias (conditions that are more likely to be transient), with the results of both combined into a single variable defining the presence of ‘anaemia’ on 1 January 2019. Unless the look-back duration was specifically stipulated, for example, 1 year for asthma clinical codes, codes present between 1 January 2000 and the study cross-section date of 1 January 2019 were used for both PC and HI data. This approach was taken to avoid relative over-ascertainment of PC codes. Historic codes are present for lifetime records that have been transcribed into the electronic record in the PC data source, but the first electronic records HI held within PEDW began on 1 April 1995. Code lists used to define conditions were those created by Kuan et al. [23] available on the HDR UK Phenotype Library [17] and de novo code lists created specifically by the authors of this study where required (detailed in Additional file 2). We adapted prescribing code lists from the Cambridge Multimorbidity Score by Payne et al. to qualify conditions that resolve as ‘active’ on 1 January 2019 (e.g. asthma, epilepsy) [24].

Prescribing and laboratory data were available within the PC datasource (WLGP). To ensure that the study reflected a fair comparison between ascertainment using codes present in PC and HI datasets based on availability within each data source, prescribing data were applied to only PC and to linked PC-HI data. Conditions were categorised by the International Classification of Diseases and Related Health Problems 10th Revision (ICD-10) (Additional file 2).

Data analysis

We conducted a suite of analyses to estimate the prevalence and concordance of individual conditions and multimorbidity, and associations with mortality, between data sources. First, prevalence estimates for multimorbidity and each of the 47 conditions were calculated separately using only PC, only HI, and linked PC and HI (PC-HI) data. Second, the number of conditions each individual had was calculated using only PC, only HI, and linked PC-HI data. Associations with 12-month mortality were estimated using binary logistic regression and were used to calculate unadjusted and adjusted (by age, sex, and deprivation) odds ratios between morbidity counts (grouped into 0, 1, 2, 3, and 4 + conditions) with 95% confidence intervals. Third, PC/HI prevalence ratios were calculated by dividing the estimated prevalence measured using only PC data by the estimated prevalence measured using only HI data. Fourth, the proportion ascertained by each data source alone compared with linked PC-HI data was calculated, with Wilson’s exact method used to calculate 95% confidence intervals [25]. Finally, we estimated concordance between only PC and only HI data by [1] calculating the percentage of patients identified as having each of the 47 conditions in both PC and HI data (hereinafter referred to as ‘percent agreement’) and [2] calculating Cohen’s kappa for each individual condition and for multimorbidity, using the following formula [26]:

$$Kappa = left({p}_{o}-{p}_{h}right) /left(1-{p}_{h}right)$$

where:

po: Relative observed agreement among PC and HI data

ph: Hypothetical probability of chance agreement between PC and HI data.

Kappa statistic for each of the 47 conditions was stratified into categories to describe concordance between data sources (slight 0.01–0.2, fair 0.21–0.40, moderate 0.41–0.60, substantial 0.61–0.8, almost perfect 0.81–1.00). Given that HI ascertainment of asthma and epilepsy was not constrained by prescribing data but PC and linked PC-HI was, the final three measures of concordance could not be assessed for these conditions.

The project received ethical approval from the SAIL Databank independent information governance panel [27]. Data cleaning was performed using SQL to query IBM DB2 databases. Analysis, performed using the glm function in ‘stats’ package, and data visualisation were performed using R version 4.1.2 [28].

Role of funding source

The funders of the study had no role in the study design, data collection, data analysis, data interpretation, or writing of the report. The corresponding author had full access to all the data used in the study and had final responsibility for the decision to submit the study for publication.

Leave a Reply

Your email address will not be published. Required fields are marked *