^{1}

^{*}

^{2}

^{*}

In this paper, a Gaussian mixture model (GMM) based classifier is described to tell whether precipitation events will happen on a certain day at a certain time from historical meteorological data. The classifier deals with a two-class classification problem where one class represents precipitation events and the other represents non-precipitation events. The concept of ambiguity is introduced to represent cases where weather conditions between the two classes like drizzles, intermittent or overcast are more likely to happen. Six groups of experiments are carried out to evaluate the performance of the classifier using different configurations based on the observation data released by Shanghai Baoshan weather station. Specifically, a typical classification performance of about 75% accuracy, 30% precision and 80% recall is achieved for prediction tasks with a time span of 12 hours.

Predicting precipitation events, as a part of weather prediction, is often done by numerical weather prediction. Numerical weather prediction predicts future weather conditions with the help of partial differential equations. Various attempts to apply machine learning methods to weather prediction have been made but often with other methods than Gaussian mixture model. The earliest attempts to apply machine learning to precipitation prediction were made using perceptrons. More recent researches are often based on artificial neural network [

Given an n-dimensional vector x , a Gaussian mixture probability density function can be written as follows,

p ( x ) = ∑ i = 1 m w i p i ( x ) (1)

where m represents the number of mixture components, and mixture weights w_{i} satisfies ∑ i = 1 m w i = 1 and w i ≥ 0 . Each component density p i ( x ) , i = 1 , 2 , ⋯ , m is the probability density function of a Gaussian distribution parameterized by a n × 1 mean vector μ i and a n × n covariance matrix Σ i . Component densities can be written as follows

p i ( x ) = 1 ( 2 π ) n 2 det ( Σ i ) 1 2 exp { − 1 2 ( x − μ i ) T ( Σ i ) − 1 ( x − μ i ) } . (2)

Given the value of m, the value of 3m parameters, w i , μ i and Σ i , i = 1 , 2 , ⋯ , m , can then be determined. EM algorithm is used to estimate these parameters. For a classifier with K classes, a GMM is trained for each class. These models are denoted by λ k , k = 1 , 2 , ⋯ , K . λ k can also be used to denote its parameters, that is λ k = { w i k , μ i k , Σ i k } , i = 1 , 2 , ⋯ , m .

In this paper, the parameters of GMMs are estimated using Expectation Maximization algorithm (EM), an algorithm to find the maximum likelihood estimate of unknown parameters. For a data set with g feature vectors { x 1 , ⋯ , x g } , the likelihood function of GMM can be written as follows

L = ∏ j = 1 g ∑ i = 1 m w i p i ( x j ) (3)

A detailed description of EM algorithm can be found in [

Since EM typically converges to a local optimum and involves random initialization, the estimated parameters may sometimes result in poor model performance. To solve this problem, a workaround is proposed as described later.

For a classifier with K classes λ k , k = 1 , 2 , ⋯ , K , a feature vector x is assigned to the class with the greatest posteriori probability. That is, assign x to class λ j if

p ( λ j | x ) ≥ p ( λ k | x ) , k = 1 , 2 , ⋯ , K (4)

Using Bayes’ theorem, this can also be written as

p ( x | λ j ) p ( x | λ k ) ≥ p ( λ k ) p ( λ j ) , k = 1 , 2 , ⋯ , K (3)

where p ( λ k ) stands for the priori probability of class λ k .

λ 1 is used to denote the class of precipitation events, and λ 2 is used to denote the class of non-precipitation events. The precipitation events classifier deals with a two-class classification problem. In this paper, we let p ( λ 1 ) = p ( λ 2 ) , thus a vector x is assigned to the class with the greatest Gaussian mixture density value. That is, the classifier reports precipitation events if

p ( x | λ 1 ) > p ( x | λ 2 ) (6)

and reports non-precipitation events otherwise. In practice, these values are computed and compared in their log form, thus the above inequality is evaluated as follows

log p ( x | λ 1 ) > log p ( x | λ 2 ) . (7)

For the precipitation events prediction problem, feature vectors of different classes can appear very close to each other in terms of distance. In such cases, the prediction results are often inaccurate. For this reason, the prediction results are flagged as ambiguous if

a b s { log p ( x | λ 1 ) − log p ( x | λ 2 ) } < log 2 (8)

which is the same as

max { p ( x | λ 1 ) , p ( x | λ 2 ) } min { p ( x | λ 1 ) , p ( x | λ 2 ) } < 2. (9)

When classified as ambiguous, the prediction results are considered close to the cases where weather conditions are between the two classes like drizzles, intermittent or overcast. However, the authenticity of the above claim is not tested since doing so will make a multiclass classification problem. In the case of evaluating the classification performance, data points flagged as ambiguous are not involved in the evaluation process. From our experimental results, we found that in most cases, about 10% of all data are flagged as ambiguous.

In this paper, the meteorological data of Shanghai, China is used for experiments. The data are obtained from Shanghai Baoshan weather station, station id 58,362 (Historical data obtained from http://www.meteomanz.com/). The station issues observation data 8 times each day, with a fixed interval of 3 hours.

We have chosen temperature, relative humidity, sea level pressure, wind direction, wind speed, total cloud cover and precipitation as features. Thus, a set of 7 × 1 feature vectors can be obtained after feature extraction. Some fields of the observation data are omitted, this is done to avoid the need to cope with too many missing data. Specifically, when wind speed is equal to 0, we let wind direction be 0. When converting original data to feature vectors, a normalization process is applied to ensure all components of the feature vectors have a lower bound of 0 and an upper bound of 100. This is simply done by linear transformations. All the features used by our model are listed in

Since observation data are given in the SYNOP format (FM-12), all possible weather conditions in observation data are known (see http://weather.unisys.com/wxp/Appendices/Formats/SYNOP.html for detail). These weather conditions are divided into the two classes and the corresponding feature vectors are accordingly classified for training. Specifically, fog, mist, haze and overcast are considered non-precipitation events, intermittent, drizzle and snow are considered precipitation events.

Even though features in observation data that contain too many missing data are omitted, there are still cases where data can be absent due to difficulty of observation etc. In such cases, these data rows are simply removed since removing these data have no effect on training or testing the classifier. This step can cause a data loss of about 60%.

When training GMMs, diagonal covariance matrices are used instead of full

Feature | Unit | Value range |
---|---|---|

Temperature | ˚C | [−30, 50] |

Relative humidity | % | [0, 100] |

Sea level pressure | Hpa | [950, 1050] |

Wind direction | ˚ | [0, 360] |

Wind speed | Km/h | [0, 50] |

Total cloud cover | N/A | [0, 1] |

Precipitation (averaged by hour) | mm | [0, 10] |

covariance matrices. This is done because it has been found that doing so will not only make GMMs perform better in practice but will also significantly reduce the computation needed since inversion of matrices is computationally intensive [

For our classification model, we refer to the class of precipitation events as positive class and non-precipitation events as negative class. Subsequently, we denote the number of actual positive data points being classified as positive by true positives (TP) or by false negatives (FN) if being classified as negative. Similar definitions can be given for true negatives (TN) and false positives (FP). To evaluate the performance of the classifier, the definition of classification accuracy is introduced. Instead of defining classification accuracy as the ratio of correctly classified samples to all samples in the data set, we define classification accuracy as follows

accuracy = [ 1 − 1 2 ( FN TP + FN + FP TN + FP ) ] (10)

Classification accuracy is defined this way because precipitation events happen less often than non-precipitation events, precipitation events data typically take up only 10% of all data, which will cause FN and TP have little effect on classification accuracy. Precision and recall are also used as key factors to evaluate classification performance, defined as follows

precision = TP TP + FP (11)

recall = TP TP + FN (12)

Since EM typically converges to a local optimum and involves random initialization, a single test is not enough to assess model performance. Thus, the classification accuracy, precision and recall are averaged over 10 trials and the averages are used as metrics for performance evaluation.

In this section, the experimental results obtained from six groups of experiments carried out to evaluate the performance of the model with different configurations are described. To illustrate the effect of the amount of data, two sets of data are used, one of which contains 3 years of historical data and the other set contains 11 years of historical data. 2015 and 2016 are chosen as the source of the 3-year data set and year 2006 to 2016 as the source of the 11-year data set. Specifically, a subset of the whole data set is chosen as training set and the remainder as test set, the number of data points is about 2:1 for training set and test set respectively. The 11-year data set is used for most of our experiments apart from experiment 2, where the model performances with different data sets are compared.

Similarly, GMMs of different number of mixtures are used, namely 16, 32, 64, 128. 64 mixtures are used except for experiment 1, where the effect of number of mixtures is assessed. A time span of 12 hours is used for predictions except for experiment 5.

In the first experiment, we compared GMMs of different number of mixtures and found that these models have similar performance regardless of their number of mixtures from the results shown in

We can tell that GMM generalize to the observation data well from the fact that there is little performance loss for test data compared with training data.

In the second experiment, the 3-year data set is used to train the GMMs and test their performance. A significant decrease in both accuracy and recall is observed in the experimental results shown in

This could be a clue that 2 years of training data may not be enough as opposed

Training accuracy | Training precision | Training recall | Test accuracy | Test precision | Test recall | |
---|---|---|---|---|---|---|

11-year | 77.11% | 32.22% | 80.36% | 73.95% | 33.10% | 73.08% |

3-year | 80.40% | 47.77% | 85.39% | 65.86% | 32.85% | 56.16% |

to 7 years of training data and that a greater number of training data points can lead to better classification performance. Additionally, a slightly higher training performance but lower test performance is observed for the 3-year data set, this means that GMM is slightly overfitting the training data. Though not strictly tested 10 times, a test run using the 3-year data set and 128 mixture components have shown a training accuracy of over 90% and a test accuracy of about 65%, which is apparently a sign of overfitting.

We tried training GMMs with separated daytime and nighttime data and separated season data in experiment 3 and experiment 4 respectively as shown in

In experiment 5, we measured how fast the predicting power of the model decrease with increasing prediction time span. The results are illustrated in

In the last experiment, we tried adding more information to the original feature vector by appending a feature vector of observation data 12 hours before the prediction is being made. Doing so forms a new 14 × 1 feature vector that is virtually two combined 7 × 1 feature vector. The test results indicate a slightly negative effect on classification performance as compared with the first experiment, which is shown in

In this paper, the same priori probabilities are chosen for both classes and log2 as the threshold for determining ambiguity. This is done because, for one thing, we want to ensure the availability of enough training data since model generalize poorly with insufficient training data. For another, the purpose of this paper is

Training accuracy | Training precision | Training recall | Test accuracy | Test precision | Test recall | |
---|---|---|---|---|---|---|

Full | 77.11% | 32.22% | 80.36% | 73.95% | 33.10% | 73.08% |

Day | 79.35% | 35.86% | 83.40% | 73.72% | 35.18% | 72.33% |

Night | 79.34% | 32.47% | 83.35% | 72.28% | 29.97% | 67.13% |

Spring | 79.14% | 39.40% | 87.74% | 72.78% | 35.68% | 65.29% |

Summer | 78.92% | 33.50% | 84.97% | 68.80% | 36.73% | 65.78% |

Training accuracy | Training precision | Training recall | Test accuracy | Test precision | Test recall | |
---|---|---|---|---|---|---|

7-dimensional | 77.11% | 32.22% | 80.36% | 73.95% | 33.10% | 73.08% |

14-dimensional | 74.27% | 29.54% | 75.94% | 68.30% | 25.44% | 63.67% |

not focused on determining the best performance the classifier can achieve but to find out how a GMM based classifier behaves to the precipitation events prediction task. To determine the optimal value for the above parameters, cross validation sets should be introduced which will cause training sets and test sets have access to less data.

From the above experimental results, the conclusion can be drawn that GMM is an effective model for predicting precipitation events. In this paper, the classifier built is one with high recall and low precision, such a classifier may be desirable when failure to prepare for precipitation events would bring about serious consequences. By altering the value of priori probabilities, a classifier with higher precision and lower recall can be obtained, since lowering the priori probability of precipitation events will make the classifier report precipitation events only when the classifier is very confident about the result. In this sense, the classifier may also be useful in cases where false alarm should be avoided. In our experiments, it is estimated that one of the most important factors affecting classification performance may be the availability of features. Cases where an unambiguous feature vector belonging to λ 1 is classified to λ 2 are observed and vice versa. This may mean that the 7-dimensional feature vector does not contain sufficient information to almost uniquely determine future weather conditions even for a time span of 12 hours. The claim can also be verified from the training accuracy shown in

Ling H.T. and Zhu, K.P. (2017) Predicting Precipitation Events Using Gaussian Mixture Model. Journal of Data Analysis and Information Processing, 5, 131-139. https://doi.org/10.4236/jdaip.2017.54010