Data Standardization Using Hidden Markov Model
Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalized in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as “standardization”. This project describes an alternative approach to standardization, using a combination of lexicon-based tokenization and probabilistic Hidden Markov Models (HMMs). The project is developed using Visual Studio with C# .Net as programming language. There is only one entity who will have the access to the system which is admin. Admin first need to login using its login credentials and then only he/she can access the system. After successful login, admin can now add the training data by filling up all the registration fields. While analyzing the data, admin will be asked to fill data in random manner into information fields. The input data in the information tabs is a collection of unstructured data which needs to be structured in a proper manner. After filling the random data in random field, admin can now analyse the inputted data, and based on input data the algorithm will process and provide the data in a structured format and the data will be displayed into analysed data in their respective fields.