Measuring academic performance of students in Higher Education using data mining techniques
thesisposted on 24.08.2018, 16:02 by Mohammed Alsuwaiket
Educational Data Mining (EDM) is a developing discipline, concerned with expanding the classical Data Mining (DM) methods and developing new methods for discovering the data that originate from educational systems. It aims to use those methods to achieve a logical understanding of students, and the educational environment they should have for better learning. These data are characterized by their large size and randomness and this can make it difficult for educators to extract knowledge from these data. Additionally, knowledge extracted from data by means of counting the occurrence of certain events is not always reliable, since the counting process sometimes does not take into consideration other factors and parameters that could affect the extracted knowledge. Student attendance in Higher Education has always been dealt with in a classical way, i.e. educators rely on counting the occurrence of attendance or absence building their knowledge about students as well as modules based on this count. This method is neither credible nor does it necessarily provide a real indication of a student s performance. On other hand, the choice of an effective student assessment method is an issue of interest in Higher Education. Various studies (Romero, et al., 2010) have shown that students tend to get higher marks when assessed through coursework-based assessment methods - which include either modules that are fully assessed through coursework or a mixture of coursework and examinations than assessed by examination alone. There are a large number of Educational Data Mining (EDM) studies that pre-processed data through the conventional Data Mining processes including the data preparation process, but they are using transcript data as it stands without looking at examination and coursework results weighting which could affect prediction accuracy. This thesis explores the above problems and tries to formulate the extracted knowledge in a way that guarantees achieving accurate and credible results. Student attendance data, gathered from the educational system, were first cleaned in order to remove any randomness and noise, then various attributes were studied so as to highlight the most significant ones that affect the real attendance of students. The next step was to derive an equation that measures the Student Attendance s Credibility (SAC) considering the attributes chosen in the previous step. The reliability of the newly developed measure was then evaluated in order to examine its consistency. In term of transcripts data, this thesis proposes a different data preparation process through investigating more than 230,000 student records in order to prepare students marks based on the assessment methods of enrolled modules. The data have been processed through different stages in order to extract a categorical factor through which students module marks are refined during the data preparation process. The results of this work show that students final marks should not be isolated from the nature of the enrolled module s assessment methods; rather they must be investigated thoroughly and considered during EDM s data pre-processing phases. More generally, it is concluded that Educational Data should not be prepared in the same way as exist data due to the differences such as sources of data, applications, and types of errors in them. Therefore, an attribute, Coursework Assessment Ratio (CAR), is proposed to use in order to take the different modules assessment methods into account while preparing student transcript data. The effect of CAR and SAC on prediction process using data mining classification techniques such as Random Forest, Artificial Neural Networks and k-Nears Neighbors have been investigated. The results were generated by applying the DM techniques on our data set and evaluated by measuring the statistical differences between Classification Accuracy (CA) and Root Mean Square Error (RMSE) of all models. Comprehensive evaluation has been carried out for all results in the experiments to compare all DM techniques results, and it has been found that Random forest (RF) has the highest CA and lowest RMSE. The importance of SAC and CAR in increasing the prediction accuracy has been proved in Chapter 5. Finally, the results have been compared with previous studies that predicted students final marks, based on students marks at earlier stages of their study. The comparisons have taken into consideration similar data and attributes, whilst first excluding average CAR and SAC and secondly by including them, and then measuring the prediction accuracy between both. The aim of this comparison is to ensure that the new preparation process stage will positively affect the final results.
Saudi Arabia, Government.
- Computer Science