Journal of Diabetes and Clinical Research
ISSN: 2689-2839

Research Article - Journal of Diabetes and Clinical Research (2022) Volume 4, Issue 1

A Machine Learning Study of 534,023 Medicare Beneficiaries with COVID-19: Implications for Personalized Risk Prediction for People Over 65

Chen Dun, MHS1*, Christi M. Walsh, MSN, CRNP1, Sunjae Bae, MD, PhD1,3,4, Amesh Adalja, MD, FIDSA5, Eric Toner, MD5, Timothy A. Lash, MBA6, Farah Hashim, BA1, Joseph Paturzo, BA1, Dorry L. Segev, MD, PhD1,3, Martin A. Makary, MD, MPH1,2

1 Department of Surgery, Johns Hopkins University School of Medicine, Baltimore, MD, USA

2 Department of Health Policy & Management, Johns Hopkins Bloomberg School of Public health, Baltimore, MD, USA

3 Department of Epidemiology, Johns Hopkins Bloomberg School of Public health, Baltimore, MD, USA

4 Department of Biostatistics, Johns Hopkins Bloomberg School of Public health, Baltimore, MD, USA

5 Center for Health Security, Johns Hopkins University, Baltimore, MD, USA

6 West Health Institute, San Diego, CA, USA

*Corresponding Author:
Chen Dun

Received date: January 14, 2022; Accepted date: February 11, 2022

Citation: Dun C, Walsh CM, Bae S, Adalja A, Toner E, Lash TA, et al. A Machine Learning Study of 534,023 Medicare Beneficiaries with COVID-19: Implications for Personalized Risk Prediction for People Over 65. J Diabetes Clin Res. 2022;4(1):10-16.

Copyright: © 2022 Dun C, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.


Background: The global outbreak of COVID-19 has resulted in over 378 million infections and worldwide casualties have surpassed 5.6 million. Identifying individuals at highest risk of COVID-19 death may help risk evaluation and health service provision. Personalized risk prediction that uses a broad range of comorbidities requires a cohort size larger than that reported in prior studies.

Methods: Medicare claims data was used to identify patients age 65 years or older with diagnosis of COVID-19 between April 1, 2020 and August 31, 2020. Demographic characteristics, chronic medical conditions, and other patient risk factors that existed before the advent of COVID-19 were identified. A random forest model was used to empirically explore factors associated with COVID-19 death. The independent impact of factors identified were quantified using multivariate logistic regression with random effects.

Results: We identified 534,023 COVID-19 patients of whom 38,066 had an inpatient death. Demographic characteristics associated with COVID-19 death included advanced age (85 years or older: aOR: 2.07; 95% CI, 1.99-2.16), male sex (aOR, 1.88; 95% CI, 1.82-1.94), and non-white race (Hispanic: aOR, 1.74; 95% CI, 1.66-1.83). Leading comorbidities associated with COVID-19 mortality included sickle cell disease (aOR, 1.73; 95% CI, 1.21-2.47), chronic kidney disease (aOR, 1.32; 95% CI, 1.29-1.36), leukemias and lymphomas (aOR, 1.22; 95% CI, 1.14-1.30), heart failure(aOR, 1.19; 95% CI, 1.16-1.22), and diabetes (aOR, 1.18; 95% CI, 1.15-1.22).

Conclusions: We created a personalized risk prediction calculator to identify candidates for early vaccine and therapeutics allocation (www. These findings may be used to protect those at greatest risk of death from COVID-19.


COVID-19, Machine learning, Risk prediction


The global outbreak of COVID-19 has resulted in over 378 million infections and worldwide deaths have surpassed 5.6 million. In the United States, there have been over 75 million confirmed cases and over 886,000 deaths as of February 1, 2022. COVID-19 heavily burdened the American healthcare system, greatly damaged the U.S. economy, and disrupted nearly every facet of life in the United States [1]. Early results of new vaccines and therapeutics aimed at fighting the COVID-19 pandemic are promising, however the supply will be constrained given worldwide demand [2,3]. On therapeutics, convalescent plasma and Remdesivir have already been supply constrained, placing clinicians in the difficult position of rationing the supply based on risk factors that have not yet been fully elucidated [4]. This dilemma will be further magnified with the market introduction of monoclonal and polyclonal antibodies which have not been pre-produced as a part of Operation Warp Speed [5].

Advanced age is a well-known risk factor of COVID-19 death, however, risk evaluation and health service provision would be better informed by more complete data [2,3]. Using a machine learning approach, we explored a wide range of factors associated with COVID-19 death in a cohort of 534,023 COVID-19 patients over 65 years of age. A prediction calculator was created to help identify the personal risk of COVID-19 mortality given an individual’s age, demographic information, and comorbidity profile.


Data source and study population

Through a special data arrangement with the U.S. Centers for Medicare and Medicaid Services (CMS), we accessed the Medicare data server which contains 100% Medicare fee-forservice claims in the U.S. We identified people age 65 years or older who were diagnosed with COVID-19 from carrier, inpatient, and outpatient claims between April 1, 2020 to August 31, 2020. A diagnosis of COVID-19 was identified using the International Statistical Classification of Disease and Related Health Problem, Tenth Revision (ICD-10) code U07.1 (2019-nCoV acute respiratory disease) [6]. The Medicare Master Beneficiary Summary File (MBSF), which includes beneficiary enrollment information, was used to identify a patient’s demographic characteristics and geographic location.

Outcome variables

The primary outcome for this analysis was mortality among hospitalized patients who were diagnosed with COVID-19. Patient death was identified if either of the following two criteria were found in a patient’s inpatient claims: 1) a discharge status was listed as “expired” or 2) a patient status indicator was listed as “died”.

Independent variables

Patients’ sex, age, race, and geographical location were identified from MBSF. Race was classified as White, Black, Hispanic, Asian, which included Pacific Islander, and Other/ Unknown, which included American Indian/Alaska Native, other, unknown. We categorized age groups as 65-69, 70-74, 75-79, 80-84, and 85 years or older. Residential zip codes were used to account for geographic differences.

Comorbidities of patients were captured using the CMS Chronic Conditions Data Warehouse (CCW). The CMS CCW is a database that lists each of the 67 available comorbidity diagnoses that have been assigned to a Medicare beneficiary between January 1, 1999 and December 31, 2018 [7].

Statistical analysis

We conducted a random forest model, which is a machine learning tool to predict an outcome by creating numerous uncorrelated decision trees to incorporate randomness as a “forest” of trees. During this procedure, the contribution of each predictor towards the predictive accuracy of the resulting model was evaluated; this quantity was referred to as the variable importance. We used the variable importance as an auxiliary means to identify the clinical factors with the highest contributions to the predictive accuracy of the model. Hence, by using a random forest model, we were able to capture conditions and comorbidities that were not emphasized in prior studies. Our adjusted multivariate model included comorbidities identified from the random forest model, comorbidities with a prevalence greater than 30%, and comorbidities that had been reported in previous studies. Geographic clustering was performed by using a logistic multilevel random intercepts model of mortality with patients nested within counties. Mixed-effects multivariable logistic regression with a random intercept for the county was conducted to identify the odds ratio of death. Odds ratios and 95% confidence intervals were reported for each risk factor. A personalized risk prediction nomogram was constructed based on the coefficients from the multivariable logistic model to calculate relative risk using the following formula: probability = exp( Σ β × X)/[1+ exp( Σ β × X)]. The odds ratio was statistically significant at α=0.05 level. Statistical analysis for this study was performed in SAS enterprise version 7.1 (SAS, Inc., Cary, NC).


Anthropometric of patients diagnosed with COVID-19

From April 1, 2020 through August 20, 2020, 534,023 Medicare beneficiaries over the age of 65 had a diagnosis of 2019 novel coronavirus (COVID-19) in at least one Medicare claim during the study period. Of those these, 148,151 (27.7%) patients were hospitalized and 38,066 (7.1%) died in a hospital.

The highest prevalence of COVID-19 infections was among white patients (n=396,198, 74.2%), female patients (n=307,595, 57.6%), and those age 85 or older (n=138,195, 25.9%). The median age of a COVID-19 diagnosed patient was 77 years (IQR 70-85), and the median age of patients who died was 80 years (IQR 73-87).

Chronic diseases risk factors

Each of six comorbidities was present in the majority of COVID-19 patients: 80% (n=423,808) had hypertension, 76% (n=402,979) had hyperlipidemia, 63% (n=335,413) had anemia, 62% (n=332,422) had a cataract, 61% (n=325,498) had rheumatoid arthritis/osteoarthritis, and 54% (n=286,025) had ischemic heart disease (Table 1). Over 65% of patients that died (n=24,927) had at least one of these comorbidities. Risk factors associated with COVID-19 death which had a low prevalence in the Medicare population over 65 years of age included pressure ulcers and chronic ulcers (n=85,740), tobacco use (n=69,437), Schizophrenia and other psychotic disorders (n=65,303), history of acute myocardial infarction (n=29,728), Blindness and Visual Impairment (n=15,202), lymphoma and leukemia (n=11,996), lung cancer (n=7,578), cerebral palsy (n=3,135) sickle cell disease (n=222), For patients over the age of 65 with no comorbidities (n=54,669), the COVID-19 infection fatality rate was 4.7% (n=2,591/54,669).

Patient characteristic
Age (years)
Mean (SD)78.0(9.0)
Median (IQR)77(70,
85 or older138,195(25.9)
Other or unknown23,935(4.5)
Rheumatoid Arthritis / Osteoarthritis325,498(61.0)
Ischemic Heart Disease286,025(53.6)
Chronic Kidney Disease230,049(43.1)
Peripheral Vascular Disease196,686(36.8)
Major Depressive Affective Disorder193,369(36.2)
Fibromyalgia, Chronic Pain and Fatigue191,780(35.9)
Heart Failure189,729(35.5)
Alzheimer's Disease and Related Disorders or Senile Dementia179,872(33.7)
Anxiety Disorders175,732(32.9)
Acquired Hypothyroidism170,906(32.0)
Chronic Obstructive Pulmonary Disease164,550(30.8)

* Top 20 comorbidities by prevalence.

Table 1: Characteristics of Medicare beneficiaries age 65 years and older diagnosed with COVID-19 from April 1 to August 31, 2020 (n=534,023).

Variable importance from random forest

A random forest model identified an accurate classification rate of 92.9%, and risk factors that are more likely to be a good fit in a model to predict COVID-19 mortality higher than the norm were chronic kidney disease, prostate cancer, the patient’s race, pressure ulcers and chronic ulcers, acute myocardial infarction, the patient’s sex, and heart failure. We included 20 comorbidities with highest variable importance in the multivariate regression.

Factors associated with in-hospital death

In an adjusted multivariate model, COVID-19 death was associated with advanced age (85 or older vs 65-69 years, adjusted odds ratio [aOR] = 2.07, CI 1.99-2.16), male sex (aOR = 1.88, CI 1.82-1.94), and non-white race (Hispanic vs White, aOR, =1.74, CI 1.66-1.83; Asian vs White, aOR = 1.71, CI 1.61-1.82; Black vs White, aOR = 1.61, CI 1.56-1.66; Other or unknown vs White, aOR = 1.44, CI 1.37-1.52). Random intercept of the county showed that COVID-19 deaths were clustered by geographic location (p<0.0001). Comorbidities associated with the highest mortality risk included sickle cell disease (aOR = 1.73, CI 1.22-2.47), chronic kidney disease (aOR = 1.32, CI 1.29-1.36), leukemias and lymphomas (aOR =1.22; 95% CI 1.14-1.30), and heart failure (aOR, 1.19, CI 1.16-1.22), followed by diabetes (aOR = 1.18, CI 1.15-1.22) and cerebral palsy (aOR = 1.18, CI 1.04-1.35) (Table 2).

Patients characteristicAdjusted
Age (years)   <0.0001
85 or older2.07(1.99,2.16) 
Male sex1.88(1.82,1.94)<0.0001
Race   <0.0001
Other or unknown1.44(1.37,1.52) 
Sickle Cell Disease1.73(1.21,2.47)0.002
Chronic Kidney Disease1.32(1.29,1.36)<0.0001
Leukemias and Lymphomas1.22(1.14,1.30)<0.0001
Heart Failure1.19(1.16,1.22)<0.0001
Cerebral Palsy1.18(1.04,1.35)0.012
Lung Cancer1.16(1.07,1.26)<0.001
Acute Myocardial Infarction1.15(1.11,1.20)<0.0001
Pressure Ulcers and Chronic Ulcers1.13(1.10,1.16)<0.0001
Chronic Obstructive Pulmonary Disease1.12(1.09,1.15)<0.0001
Tobacco Use1.08(1.04,1.11)<0.0001
Blindness and Visual Impairment1.07(1.01,1.13)0.020
Peripheral Vascular Disease1.06(1.03,1.09)<0.0001
Alzheimer's Disease and Related Disorders or Senile Dementia1.06(1.03,1.09)<0.0001
Schizophrenia and Other Psychotic Disorders1.06(1.02,1.09)0.002
Ischemic Heart Disease1.06(1.02,1.09)<0.0001
Mobility Impairments1.05(1.01,1.09)0.009

*OR=Odds Ratios, 95% C.I. =95% Confidence Interval, Ref=Reference group.

Table 2:Odds Ratios for COVID-19 death among 534,023 patients*.

Construction of a risk calculator

We used the coefficients of patients’ age, sex, race, and all reported risk factors to create a personalized COVID-19 mortality risk prediction calculator. Based on this calculator, an 80-year-old Hispanic man with chronic kidney disease has a mortality risk 6 times higher than a 66-year-old white woman with no comorbidities. The risk calculator is available at www.


Our study is the largest comorbidity analysis of COVID-19 patients in the U.S to date. Using a large sample size, geographic clustering, and the encompassing pool of independent variables captured in this study, we identified that the comorbidities of sickle cell disease, chronic kidney disease, leukemias and lymphomas, heart failure, and diabetes are all associated with higher rates of COVID-19 death. These results revealed high risk individuals who should be considered for prioritization of vaccine and therapeutic medication allocation. These findings were also used to develop a risk calculator to allow clinicians to identify patients who were at a highest risk of COVID-19 mortality.

Our results expanded on findings reported in prior studies describing COVID-19 mortality risk factors [8-10]. A large UKbased cohort study of 10,926 COVID-19 deaths found that increased age, male sex, and Black race were associated with a higher risk of mortality [11]. Our results showed that with every five-year increase over age 65 years, there was an approximately 20% increase in mortality. As consistent with prior reports, we also found racial and ethnic minority groups had a higher risk of COVID-19 mortality compared with White patients. A study in New York of 4,312 COVID-19 patients observed that Black patients had the highest relative risk compared to other race groups [12]. Other analyses have found Black patients to have the greater risk of COVID-19 death, however we found that Hispanic patients had the highest mortality risk and that black patients were a close second in this Medicare population [12-14]. The disproportionate impact of COVID-19 on minorities may be related three possibilities. First, there may be a disproportionate burden of undiagnosed comorbidities such as diabetes, HIV, liver disease, cardiovascular disease, asthma, and kidney disease in minority communities [15]. Second, minority patients may be acquiring the infection with a higher viral load given denser settings in which minority populations live, commute and work. The size of the viral load involved in a patient has been associated with the infection fatality rate [16,17]. These populations may also be more likely to live in multigenerational households where ventilation may be poor and effective social distancing may not be feasible [18]. Finally, minorities may have poorer access to quality health care [12-14].

Another U.K. study of 20,133 UK patients hospitalized with COVID-19 observed that death was associated with the pre-existing comorbidities of chronic cardiac disease, chronic nonasthmatic pulmonary disease, chronic kidney disease, obesity and liver disease [19]. Similarly, a U.S. cohort study of 11,721 hospitalized patients with COVID-19 in 38 states found the comorbidities of chronic kidney disease and cardiovascular disease to be associated with an increased odds of mortality [20]. A U.S.-based study of 521 patients with chronic kidney disease who became critically ill with COVID-19 described a hazard ratio of 1.25 [21]. A more recent study found that a hazard ratio of chronic kidney disease based a systematic review was 3.61 [22]. Another meta-analysis showed that chronic kidney disease is the second leading risk factor with an odds ratio of 3.6 [23]. Our study strongly reinforces the hypothesis that chronic kidney disease is major risk factor for COVID-19 death. We observed that chronic kidney disease was the second leading risk factor among all comorbidities with an adjusted odds ratio of 1.32, second only to sickle cell disease which is rare in the Medicare population over age 65 years.

Using a large sample size, our study affirmed the association of hypertension, diabetes, chronic obstructive pulmonary disease, cardiovascular disease, obesity, and lung cancer with COVID-19 death, in line with previous large size comorbidities studies [10-12,24-29]. However, different from prior studies, our sample revealed sickle cell disease and leukemias and lymphomas to be major risk factors for COVID-19 death. Previous case studies have reported that sickle cell disease patients had favorable outcomes and suggested that mortality risk in this population was inconclusive [30,31]. One explanation for our finding may be that sickle cell disease is associated with impaired oxygen exchange, which may be further impeded during the inflammatory phase of the infection.

Obesity has been well-described to be a risk factor for COVID-19 death [11,19,20]. One study showed that COVID-19 patients with a BMI between 30-34 kg/m2 and >35 kg/m2 were 1.8 and 3.6 times more likely to require critical care, respectively, when compared to individuals with a BMI of <30 kg/m2 [32]. Another study showed a hazard ratio of 1.6 among obese patients [22]. We observed a risk lower than described in other studies. We believe the weaker association we report to be due to the well-known low sensitivity of obesity codes in claims data to detect obese and overweight individuals [33]. Therefore, we believe the impact of obesity, overweight, and undocumented metabolic syndrome or pre-diabetes on poor COVID-19 outcomes to be under-reported in this claimsbased study.

Some of the risk factors we identified may be co-linear with institutionalized patients. Specifically, cerebral palsy, chronic ulcers, blindness, Alzheimer’s disease and related dementias, and mobility impairments are likely more prevalent in persons living with assisted or nursing care. Higher risk in these populations may be due to the mode of transmission in residential facilities.

We reported the first COVID-19 risk factor analysis using machine learning algorithms. Various studies have established the random forest approach as a useful method for modelling risk prediction [34-36]. We incorporated a random forest approach into the COVID-19 risk factor exploration to identify variables that could be used to predict a better model. Our results revealed that 60% of comorbidities (n=12) captured from random forest model were also in the list of top 20 comorbidities had the highest odds ratio of COVID-19 death. Overall, random forest as a machine learning approach had a preponderance on selecting important variables that could improve the model fitness [37].

This study had some important limitations. First, we were unable to make conclusions about the infection fatality rate in the community because claims data are specific but not sensitive for infection with COVID-19. Second, we only included inpatient hospital deaths since there is a time lag in the availability of death data outside the hospital setting in the Medicare dataset.


As COVID-19 vaccines and therapeutics become available, the risk factors we report could inform the allocation of these limited resources until they become more widely available. The application of these results to the personalized risk calculator may help educate clinicians and the public on which patients aged 65 years or older are at highest risk of COVID-19 mortality.


We thank West Health Institute for their research support on this study. We also thank Dr. Brian W. Weir and Dominique Vervoort for their help in preparing this manuscript.

Author Contributions

Dr. Martin A. Makary had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Makary, Dun, Lash, Segev, Walsh, Bae. Acquisition, analysis, and interpretation of data: Makary, Dun, Segev, Walsh, Toner, Bae. Drafting of the manuscript: Dun, Makary, Walsh. Critical revision of the manuscript of important intellectual content: Makary, Walsh, Segev, Adalja, Toner. Statistical analysis: Dun, Bae. Obtained funding: Dun, Walsh, Makary. Administrative, technical, or material support: Dun, Walsh, Bae, Adalja, Toner, Lash Hashim, Paturzo, Segev, Makary. Study supervision: Makary.

Conflict of Interest Disclosures

None reported.


This research was supported by a grant from the Gary and Mary West Health Institute.