Xiang Rumei,Wei Xing, Dai Wei,Zhang Lijun,Xu Wei,Tian Jie,Zhang Hongwei, Sun Jiaxin,Shi Qiuling
Objective Accurate and standardised data form the foundation for reliable research findings. Taking lung surgery as a case study, we analyse the data traits of an anesthesia information system and undertake preprocessing such as encompassing cleaning, conversion, integration and imputation to build a researchready dataset.Methods Relevant data from the anesthesia information system of patients undergoing lung surgery at Sichuan Cancer Hospital between April 2021 and November 2022 were collected. The characteristics of the source data were analysed, and the Python and SAS software were proposed for data preprocessing. Text data were transformed into numerical values for the ease of data mining using Python's SPLIT statements, SAS macros, and functions. Missing values were filled, and anomalies, inconsistencies, and redundant data were corrected through data cleaning and data reduction. Data integration was achieved through NOUNIQUEKEY, SQL and LAG statements to expand the data volume.Results Two Excel sheets were extracted from the anaesthesia information system and the hospital information system, comprising a total of 1 835 anesthesia records and 46 612 medical records. Analysis of the source data revealed that the anaesthesia information system had idiosyncratic medical lexicon, varied semantic expressions, multiple outlines for identical drugs, and certain drugs ending in "alternate". Based on the given data characteristics and semi-structured data structure, we compiled three macros to clean and validate all drug names, standardise medical terminology, and unify outlines. This process led to the extraction of 12 drugs for pre-anaesthesia, 24 drugs for intraoperative use, and 12 drugs for analgesic pumps. Secondary completion of missing data was performed, as well as noise reduction and cleaning of inconsistent data. Forty-eight anesthesia records (2.62%) of non-pulmonary were excluded and 10 irrelevant fields for the mining task were removed. After data integration, 1 748 cases of anesthesia data (97.82%) were matched with medical prescription data. After the data preprocessing described above, the final structured dataset consisted of 1 748 patients and 99 variables.Conclusion The anaesthesia data preprocessing process developed through the analysis of source data achieves data cleaning, data integration, data transformation and data reduction, and thus obtains standardised and precise drugs data. It provides a methodological reference for data cleaning and structuring of anaesthesia information in other institutions and at the same time provides a reliable data base for research that needs to use high quality anaesthesia medication data, which will contribute to the depth and advancement of related research.