|
|
Comparison of the effects of discretization methods of continuous explanatory variables based on logistic regression |
He Xianying |
National Engineering Laboratory for Internet Medical Systems and Applications, the First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, China |
|
|
Abstract Objective To explore the advantages and disadvantages of different data discretization methods when the continuous independent variable and logit-π-do not satisfy the linear relationship, and to provide reference for the continuous independent variable discretization in logistic regression analysis.Methods A casecontrol study was used to generate simulated data from the three perspectives of effect size, number of independent variables, and sample size with R software. The continuous independent variables were further processed by different discretization to fit logistic regression models, and the fitting effects of different methods were compared.Results The results of different simulation data sets showed that when the four continuous variable processing methods were used to fit the logistic regression model, "maximum OR values method" could better screen out meaningful influencing factors. At the same time, its model fitting effect was also the best, which was shown as the smallest-AIC, the largest Nagelkerke R2, and the higher correct total rate.Conclusion It is recommended to use "maximum-OR-values method" to discrete continuous variables if the relationship between these variables and logitπ is non-monotonic.
|
Received: 14 May 2022
|
|
|
|
[1]SCHULGEN G, LAUSEN B, OLSEN J H, et al. Outcome-oriented cutpoints in analysis of quantitative exposures[J]. Am J Epidemiol, 1994, 140(2):172-184. DOI:10.1093/oxfordjournals.aje.a117227.
[2]ABDOLELL M, LEBLANC M, STEPHENS D, et al. Binary partitioning for continuous longitudinal data: Categorizing a prognostic variable[J]. Stat Med, 2002, 21(22):3395-3409. DOI:10.1002/sim.1266.
[3]ROSNOW R L, ROSENTHAL R, RUBIN D B. Contrasts and correlations in effect-size estimation[J]. Psychol Sci, 2000, 11(6):446-453. DOI:10.1111/1467-9280.00287.
[4]SINK C A, MVUDUDU N H. Statistical power, sampling, and effect sizes[J]. Couns Outcome Res Eval, 2010, 1(2):1-18. DOI:10.1177/2150137810373613.
[5]何贤英,赵志,黄嘉玲,等.自变量连续型测定值及基于中位数的0~1转化值拟合logistic回归模型的效果比较[J].中国卫生统计,2017,34(6):869-872.
[6]WAND M P, COULL B A, FRENCH J L, et al. SemiPar 1.0. R package[A]. Vienna: Comprehensive R Archive Network Project, 2005.
[7]何贤英,赵志,温兴煊,等.logistic回归中连续型自变量离散化为二分类变量时适宜分界点的确定[J].中国卫生统计,2015,32(2):275-277.
[8]WILLIAMS B A, MANDREKAR J N, MANDREKAR S J, et al. Finding optimal cut points for continuous covariates with binary and time-to-event outcomes[R/OL].Minnesota:Department of Health Sciences Research Mayo Clinic Rochester,2006.https://www.researchgate.net/publication/241590524_Finding_Optimal_Cutpoints_for_Continuous_Covariates_with_Binary_and_Time-to-Event_Outcomes.
[9]ALTMAN D G, LAUSEN B, SAUERBREI W, et al. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors[J]. J Natl Cancer Inst, 1994, 86(11):829-835. DOI:10.1093/jnci/86.11.829.
[10]FARAGGI D, SIMON R. A simulation study of cross-validation for selecting an optimal cutpoint in univariate survival analysis[J]. Stat Med, 1996, 15(20):2203-2213. DOI:10.1002/(SICI)1097-0258(19961030)15:20<2203:AID-SIM357>3.0.CO;2-G.
[11]HILSENBECK S G, CLARK G M. Practical p-value adjustment for optimally selected cutpoints[J]. Stat Med, 1996, 15(1):103-112. DOI:10.1002/(SICI)1097-0258(19960115)15:1<103:AID-SIM156>3.0.CO;2-Y.
[12]CONTAL C, O′QUIGLEY J. An application of changepoint methods in studying the effect of age on survival in breast cancer[J]. Comput Stat Data Anal, 1999, 30(3):253-270. DOI:10.1016/S0167-9473(98)00096-6.
[13]MAZUMDAR M,GLASSMAN J R.Categorizing a prognostic variable:Review of methods, code for easy implementation and applications to decision-making about cancer treatments[J].Stat Med,2000,19(1):113-132.DOI:10.1002/(sici)1097-0258(20000115)19:1<113:aid-sim245>3.0.co;2-o.
[14]MAZUMDAR M, SMITH A, BACIK J. Methods for categorizing a prognostic variable in a multivariable setting[J]. Stat Med, 2003, 22(4):559-571. DOI:10.1002/sim.1333.
[15]CUMSILLE F, BANGDIWALA S I, SEN P K, et al. Effect of dichotomizinlg a continuous variable on the model structure in multiple linear regression models[J]. Commun Stat Theory Methods, 2000, 29(3):643-654. DOI:10.1080/03610920008832507.
[16]ROYSTON P, SAUERBREI W, ALTMAN D G. Modeling the effects of continuous risk factors[J]. J Clin Epidemiol, 2000, 53(2):219-221. DOI:10.1016/s0895-4356(99)00163-8.
[17]HEINZL H, TEMPFER C. A cautionary note on segmenting a cyclical covariate by minimum P-value search[J]. Comput Stat Data Anal, 2001, 35(4):451-461. DOI:10.1016/S0167-9473(00)00023-2.
[18]LIQUET B, COMMENGES D. Correction of the P-value after multiple coding of an explanatory variable in logistic regression[J]. Stat Med, 2001, 20(19):2815-2826. DOI:10.1002/sim.916.
[19]CHEN Y M, HUANG J L, HE X Y, et al. A novel approach to determine two optimal cut-points of a continuous predictor with a U-shaped relationship to hazard ratio in survival data: Simulation and application[J]. BMC Med Res Methodol, 2019, 19(1):96. DOI:10.1186/s12874-019-0738-4.
[20]MACCALLUM R C, ZHANG S B, PREACHER K J, et al. On the practice of dichotomization of quantitative variables[J]. Psychol Methods, 2002, 7(1):19-40. DOI:10.1037/1082-989x.7.1.19.
[21]冯国双,陈景武,周春莲.logistic回归应用中容易忽视的几个问题[J].中华流行病学杂志,2004,25(6):544-545.DOI:10.3760/j.issn:0254-6450.2004.06.022.
[22]Ruppert D, Wand M P, Carroll R J. Semiparametric regression[M]. New York:Cambridge University Press, 2003.
|
[1] |
Liu Haixia, Yan Haosen, Li Rui, Xu Zhaoyang, Xu Mingdan. Depression status and its influencing factors in urban and rural elderly in China[J]. journal1, 2022, 29(3): 201-206. |
[2] |
Gu Yuting, Weng Jun, Peng Zhigang. Influencing factors of delayed filing of medical records in the internal medicine department[J]. journal1, 2021, 28(5): 447-451. |
[3] |
Yin Yuhua, Wu Huatun, Zhu Jianqian. Research on the status quo of medical record quality and the influencing factors of unqualified rate in a hospital[J]. journal1, 2021, 28(5): 438-442. |
[4] |
Chen Lehui, Zhang Rui, Xue Dongmei, Wu Du. Relationship between serum γ-glutamyltransferase levels and metabolic syndrome in an elderly population[J]. journal1, 2021, 28(3): 215-219. |
[5] |
Yu Xiao. Analysis of risk factors for non-reflow after percutaneous coronary intervention in patients with acute ST-segment elevation myocardial infarction based on random forest algorithm[J]. journal1, 2021, 28(1): 6-11. |
[6] |
Zhang Xingzhen, Huang Jian, Xi Weiwei, Ying Jun. Diagnosis of minimal change disease based on optimized logistic regression model[J]. journal1, 2020, 27(6): 498-501. |
|
|
|
|