FTIR fingerprint — testing a new representation of the binary fingerprint based on FTIR spectra in the prediction of physicochemical properties

The paper deals with the development of a new method for the generation of binary fingerprints based on the Savitzky-Golay (SG) algorithm and first-order derivatives of FTIR spectra, which are then used to create prediction models for selected the physico-chemical properties of chemical compounds. Models based on the FEDS (Functionally-Enhanced Derivative Spectroscopy) transformation and raw spectra were used as a reference to determine whether the use of the SG filter and first-order derivatives was worth to further develop. The FTIR spectra of 103 compounds with theoretically determined values of logP, logD and logS were studied. The Tanimoto coefficient and correlation coefficient were used to compare the fingerprints obtained, while the root mean square error (RMSE) was used to assess the quality of the prediction models. Based on the results, it was found that the use of the SG filter and derivatives had a positive effect on the quality of the prediction models for logP and logS, and a negative effect on the quality of the models for logD, compared to the models based on original spectra and FEDS transformation.


Introduction
In recent years chemistry has evolved from a field based solely on direct work with chemical substances and the use of instrumental methods to science that makes use of the latest advances in mathematics, computer science, and many other branches of science.By using appropriate mathematical calculations and the computational power of modern computers, the process of making new substances in the pharmaceutical, food or chemical industries can be simplified.The combination of well-known instrumental methods with suitable (in silico) mathematical methods offers almost unlimited possibilities [1].
A fingerprint is one of many methods of representing chemical compounds, in this case in the form of a sequence of bits [2,3], where zeros indicate the absence of a given property (i.e.substructure) and ones indicate the presence of a given property.Molecular fingerprints are generally not a unified way of representing a chemical compound (unlike SMILES or SMARTS codes), and up to date many different ways of generating fingerprints have been reported [4,5].One of the most important application of molecular fingerprints is to use them to create mathematical models to predict physicochemical or biological properties.The development of a relatively reliable model has the potential to simplify the process of developing new therapeutic agents, for example, by predicting some of the properties of a compound without the need for long-lasting, and expensive experimental studies.
One of the biggest obstacles in analyzing a digital signal (e.g. a spectroscopic spectrum or an electrocardiogram) is noise.Noise not only makes analysis difficult, but it also affects the visual aspects of presenting the results.In the case of spectroscopic spectra, they can appear as fluctuations in absorbance (or transmittance).Noise is a type of signal distortion caused by, among other things, errors in the detector itself [6].For these reasons, it is not possible to eliminate noise from a digital signal by, for example, calibrating the instrument or taking many measurements.One method of eliminating noise is to filter the received signal using an appropriate numerical algorithm.One of the most popular digital filters is the Savitzky-Golay (SG) filter [7].Due to its simplicity and efficiency, it has found enormous applications in analytical chemistry or chemometrics.It was popularized by Abraham Savitzky and Marcel J.E. Golay in 1964, when they published tables of convolution coefficients for various polynomials and subset sizes [8].Thanks to the popularity of the Savitzky-Golay filter, the original paper was recognized by the journal Analytical Chemistry as the fifth-best paper published in the journal [9].
The basis of the Savitzky-Golay filter is convolution, which fits a low-degree polynomial to a given point and a predetermined number of neighboring points (window width) using a least-squares method [10].In practice, this means that the SG filter extracts from a given data set a small subset, centered on the point under study and a predetermined number of neighboring points on either side of said point.A polynomial with a pre-selected degree is then fitted to such a subset (therefore, the width of the window determines the power with which the signal will be smoothed).The SG filter iterates the above steps for each point in the set.
The use of derivatives as a tool for the analysis of spectroscopic spectra dates back to the 1950s.However, it was not until the 1970s that their development and popularity increased [11].They offer a great deal of convenience in the analysis of spectra, including the ability to easily locate significant parts of the absorption bands.These can include the locations of peaks, inflection points and absorption band edges.
One of the major obstacles in the analysis of infrared spectroscopic spectra is the spectral band overlap (SBO) phenomenon [12,13].The reason for this phenomenon is the occurrence of absorption bands of different chemical bonds in the same or similar wavenumber intervals.It is particularly noticeable among bonds between the same atoms at different groups of compounds (for example, the N-H bond in amides and amines).There are methods to minimize the effect of SBO, including: • increasing resolution and minimizing noise; • modifications to the test sample (e.g.change of solvent); • mathematical transformations, i.e.: derivatives (the most commonly used are second-order derivatives) [14] and deconvolution [15].One method of FTIR spectra transformation is Functionally-Enhanced Derivative Spectroscopy (FEDS).The main objective of FEDS is to separate the bands and simplify the spectrum by narrowing the individual bands without significantly changing their position.This is achieved by creating a P-function from a series of simple functions [16]. (1) where: P i -the P-function for the i-th point, A i -the absorbance for the i-th point.

Data set
The FTIR spectra of 103 compounds (the full list is in Appendix 1) were obtained in our previous work [17,18] and were used in this research.Also, the results for the methods examined in that work were used for comparative purposes.
A Nicolet™ iS™ 5 FTIR spectrophotometer was used to measure the spectra.Each spectrum was obtained by averaging 16 scans taken at a resolution of 2 cm −1 over the range of 4000-650 cm −1 .The measurements were recorded using OMNIC software, while further processing (smoothing and derivation of the spectrum) was performed using RStudio.Smoothing and derivation were performed with custom scripts using R libraries, i.e. prospectr and rootSolve.Predictive models were developed using the KNIME (KNstanz Information MinEr) environment (Appendix 3).Theoretical and experimental physicochemical properties used for the research were fetched from the chemspider.comwebsite (accessed April 7, 2022).

FTIR fingerprint algorithm
The algorithm of FTIR-based molecular fingerprint generation was proposed in our previous work [17,18].Herein, we tested the performance of this algorithm by adding a preprocessing step for the original FTIR spectra.Thus, the generation of the binary fingerprints in the present work was performed according to the following protocol: 1. Preprocessing of the FTIR spectra (smoothing and differentiation).

Locating the roots of the derivative spectrum
(roots refer to the location of the peak in the spectrum).3. Matching of localized root positions to pre-selected wavenumber intervals corresponding to peak locations for individual absorption bands.4. Generation of spectral fingerprint (f s ).If the location (i.e.wavenumber) of the root matches any of the wavenumber intervals, the bit corresponding to that wavenumber interval has the value of '1'. 5. Comparison of the spectral fingerprint with a molecular fingerprint (f m ), which is based on the molecular structure of the compound.For example, if the compound is an alcohol, the C-O bond will result in the presence of a bit at positions 80 and 83 in the molecular fingerprint.
A table listing our definition of the substructures for each position in the f m is in Appendix 2. 6.The agreement of a given bit in both fingerprints results in the presence of a bit ('1') in the final fingerprint, while the absence of a bit in one or both fingerprints in a given position results in the absence of a bit ('0') in the final fingerprint.Based on the obtained fingerprints and the known values of logP, logS, and logD, the predictive models were built using linear regression and regression tree algorithm.A genetic feature (variable) preselection strategy with iterations of 1000 and a population size of 50 was used for the calculations for both the regression tree and the linear regression algorithm.In addition, the data for linear regression was pre-processed using PCA (Principal Component Analysis) with a target dimension of 50.The quality of a given prediction model was assessed by comparing the true value of a given physicochemical parameter with its predicted value.This was obtained by the use of the root mean square error (RMSE) parameter.

Basic methods of FTIR spectra preprocessing
Firstly, the FTIR fingerprints were investigated based on pure experimental spectra modified by a first-order derivative without any pre-processing.However, this type of approach renders the use of spectroscopic spectra completely meaningless, as all the noise and interference present in the spectrum are interpreted as absorption bands.The spectral fingerprint obtained in this way was characterized by a high and unjustified number of bits on ('1'), since the number of roots in the derivative spectrum is also high (as shown in Figure 2, the differentiation of the pure spectrum resulted in 293 roots).
Based on these results, it can be concluded that the use of a suitable filter is necessary to eliminate as much noise as possible.In the present study, it was decided to use an SG filter because of the nature of the noise present in the experimental spectra.If the spectrum is measured correctly, the noises that can be observed are local fluctuations in absorbance that tend to accumulate at the end of the spectrum and in areas where there are no absorption bands.During the research, it was concluded that direct application of the SG filter across the spectrum is a significant complication due to the different densities of absorption bands in different areas of the spectrum.In general, FTIR spectra can be divided into two regions: the fingerprint region, which has no specific conventional limit but is considered in this article to be between 650 and 1500 cm −1 , and the functional group region, which is between 1500 and 4000 cm −1 [19].The fingerprint region is characterized by a dense packing of absorption bands [20], so the application of a high-power filter can cause the merging of the separate absorption bands.However, for the functional group region, which is characterized by a lower packing of absorption bands and a higher presence of noise, a relatively higher power filter would be preferable.To avoid this phenomenon when processing the spectra, the two areas were processed separately with different window width.
After a thorough visual analysis of the spectra, as well as an examination of the influence of the window width on the number of detected roots in the fingerprint and the function group area, it was concluded that the best smoothing effect with the SG filter is achieved for the window width of 200 assigned to the fingerprint area, and 400 to the function group area.

Advanced methods of FTIR spectra preprocessing
Secondly, four methods of processing the FTIR spectra were developed and for each method, two cut-off levels were used to generate spectral fingerprints.The selection criterion was to make the cut-offs as representative as possible, e.g. if both the higher and lower cut-offs did not significantly affect the number of bits in the final fingerprint or were similar in terms of the Tanimoto coefficient, both were omitted: a) Baseline (BS) -all absorbance values below a given cut-off are removed (Figure 4).Used cut-offs are 0.15 and 0.10; b) Pre-Smoothing Fragmentation (PrSF) -the spectrum is split into fragments which are either smoothed using the regular SG filter (window width = 400) or with a far stronger SG filter (window width = 1500), if the amplitude of the values in a given fragment is less than the cut-off (Figure 5).Used cut-offs are 0.15 and 0.25; c) Post-Smoothing Fragmentation (PoSF) -the smoothed spectrum is split into fragments and the given fragment is either left unchanged or all absorbance values in the fragment are set to 0 if the amplitude of the values in a given fragment is less than the cut-off (Figure 6).Used cut-offs are 0.15 and 0.25;

Prediction models evaluation
Based on the above-described methods, final fingerprints were generated according to the algorithm presented in the previous section and then used to create models to predict physicochemical properties.Predictive models were based on linear regression and a regression tree algorithm.
During the study, it was observed that in almost all cases the models based on the regression tree were more accurate in predicting physicochemical properties.As can be seen in Figure 8, the prediction models for logP and logS obtained by the regression tree method show better quality than those obtained by linear regression (except for PoSF with a cut-off 0.15, which showed a 2.45% worse result).The prediction models for logD showed a significantly higher degree of similarity between the results of the two methods, but the regression tree still proved to be the better method (by 6.11% on average).Therefore, the results obtained using the regression tree method will be used to compare the methods themselves.
As can be seen in Table 1, it is not possible to determine the best method for processing spectroscopic spectra to generate fingerprints in the context of creating predictive models.To compare the methods, the deviation from the lowest error in the group was determined for each RMSE value obtained.The mean of the results for each method and the cut-off was then calculated.It can be seen that fingerprints obtained by the baseline (BS) and derivative fragmentation (DF) methods showed relatively the highest performance.In the case of fingerprints obtained using the baseline method (BS), the RMSE values showed an average deviation from the best result of 23.53% (with a deviation for a cut-off of 0.15 of only 14.41%).For the derived fragmentation method, the same parameter was 29.69%.In contrast, the PrSF method consistently showed some of the worst results among all the physicochemical properties tested, with an average deviation of 54.90%.The results for the PoSF method proved highly variable.Compared to the methods developed in our previous paper, the methods discussed here showed lower RMSE error values for the prediction models of logP and logS.This was particularly the case for logS, where all methods and cut-offs tested, except the PRSF method, showed lower RMSE error values.For logD, the methods discussed in this paper showed slightly higher RMSE error values.

Conclusions
This article discusses the use of the Savitzky-Golay filter and derivatives in the creation of a new type of molecular representation based on molecular structure and FTIR spectra.This is a particularly important topic due to the growing interest of in silico methods in the pharmaceutical, cosmetic, and agrochemical industries, among others.The use of in silico methods can significantly reduce the time and cost of developing new compounds with desired bioactive properties.
In summary, the use of the Savitzky-Golay filter and derivatives had a positive effect on the quality of the resulting prediction models compared to the FEDS transformation for all of the physicochemical parameters compared.Compared to models based on original spectra, the positive effect of the SG filter and derivatives was evident only for predictive models for logS and to a lesser extent for logP.All these observations show that the application of the SG filter and derivatives for new fingerprint representation has great potential and is a good direction for the development of this particular in silico method.
of the research project B -Assembly of data for the research undertaken C -Conducting of statistical analysis D -Interpretation of results E -Manuscript preparation F -Literature review G -Revising the manuscript

Figure 3 .
Figure 3.Comparison of the influence of the window width used in the SG filter on the shape of the FTIR fingerprint region.Black indicates a slice of the original spectrum, blue with a window width of 100, red with a window width of 200, and green with a window width of 400

Figure 5 .Figure 6 .Figure 7 .
Figure 5.The effect of window widths of 500 and 1500 on the smoothing power of selected parts of the trimesic acid spectrum

Figure 8 .
Figure 8.The percentage deviation between the RMSE values for the models obtained using a regression tree and linear regression

FTIR
fingerprint -testing a new representation of the binary fingerprint…

Table 1 .
[17,18] of minimum RMSE error values for regression tree models predicting logP, logD and logS values.The lowest RMSE error values for the prediction models based on the FEDS method and the original spectra for each physicochemical property are shown for a reference[17,18] Science, Technology and Innovation, 2022, 17 (1-2), 9-29 FTIR fingerprint -testing a new representation of the binary fingerprint…