COMPARISON OF SPELL CORRECTION IN BAHASA INDONESIA: PETER NORVIG, LSTM, AND N-GRAM

This study conducts a comprehensive comparison of spell-checking methods in Bahasa Indonesia, specifically focusing on three approaches: Peter Norvig's method, Long Short-Term Memory (LSTM), and N-gram. The primary metric for evaluation is the accuracy in correcting spelling errors. Notably, Peter Norvig's method outperforms the others, with N-gram following closely, and LSTM trailing behind. The study draws valuable insights that contribute to the enhancement of spelling correction accuracy in the Bahasa Indonesia language. To carry out the evaluation, the research employs SPECIL data (Spell Error Corpus for Indonesian Language), which includes documents with various error types such as insertion, deletion, transposition, and substitution. The testing dataset consists of 150 words, aligning with the 150-word corpus references from the 'Leipzig Corpora Collection' used for Peter Norvig's and N-gram methods. Peter Norvig's method stands out as the most robust, achieving an impressive accuracy rate of 89%. The N-gram method follows closely with a 75% accuracy rate, showcasing its effectiveness. Meanwhile, LSTM, while still providing reasonable accuracy at 74%, trails behind the other two approaches. It's noteworthy that the LSTM method utilizes a reference dataset from SPECIL, comprising 150 data points and specifically focusing on insertion errors for the test data. This research provides valuable insights for researchers, developers, and language technology enthusiasts seeking to refine spell-checking techniques for the Bahasa Indonesia language. By leveraging diverse error types and a standardized testing dataset, the study aims to contribute to the continual improvement of spell-checking tools.


INTRODUCTION
In 2023, 82% of all internet data is comprised of text, as reported by the Cisco Visual Networking Index.This underscores the dominance of text as the predominant data type on the internet.Textual data can be found in various forms, such as news articles, blogs, social media posts, and official documents.To illustrate the magnitude of textual data on the internet, as of January 2022, Google processed more than 5.6 billion searches per day, each search encompassing text in various languages and topics [1].The openness of information on the internet has made the compilation and management of textual data an increasingly complex task.This data varies in quality, accuracy, and relevance, necessitating careful processing.In an academic context, precise and typofree writing is of paramount importance [2].Research, assignments, papers, and scientific reports require the ability to compose clear, cohesive, and accurate text.Writing errors can diminish credibility, disrupt comprehension, interpretation of text, and the impact of the writing.This can lead to confusion, affect the validity of the writing, and undermine the impression of professionalism.Hence, error-free writing is essential.
Typographical errors in documents are clearly produced by a variety of factors, including unintentional errors, mechanical faults, hand or finger slips, and the proximity of letters on the keyboard [3].The system that can help detect errors and provide suggestions for the correct words is the spelling correction or spelling suggestion system.This system's function is to detect errors and provide alternative word recommendations [4].Spelling correction includes two types of checking: real-word spell checking and non-word error spell checking.While real-word spell checking focuses on processing words Kusuma, A T A, and Ratnasari C I, Comparison of Spell … 215 that remedy flaws in the phrase, non-word error spell checking deals with misspelled words caused by typographical errors [2].
Several spelling-check studies focus more on typographical errors as the source of word errors [5].In relation to earlier research, the study conducted by [6] determined the common type of spelling error by utilizing Levenshtein distance and N-gram.This study used 4,453 misspelling datasets in English gathered by Wikipedia contributors.This dataset gives the right words for each misspelled word token and addresses typographical issues in Wikipedia articles.In the evaluation stage of this research, recall calculations were processed using the correct words to achieve the research objectives.The output findings reveal that the Levenshtein distance has a greater recall value than the N-gram, with 79% and 65%, respectively.Another similar study was conducted by [7], examining spelling correction using Peter Norvig and N-gram.According to the findings of this study, the Peter Norvig approach is incapable of correcting spelling problems, such as sentences with two misspellings in a single word.There are also a few sentences that include personal names.As a result, the terms containing those surnames are considered spelling mistakes because they are not included in the KBBI dictionary word list.Using 55 texts as test material, the spelling correction accuracy value is 69.09%.Another study was undertaken by [8], which used the LSTM (Long Short-Term Memory) approach to perform a spell check.There are 12,961 unique words and 100,000 words in the tiny data set used to train and test sizes.For the massive data set, 80% of the total data set is used for training and 20% for testing.The reasoning behind evaluating both small and large data sets is that some applications, such as query correction, require just terms from dictionaries with a limited vocabulary.The LSTM approach has 73.77% accuracy and a processing time of 0.328 seconds per word.
In this study, we will compare numerous spelling correction methods, including Peter Norvig, N-gram, and LSTM.The approach itself is used because it is flexible and has parallels with other methods.Three techniques Peter Norvig, n-gram, and LSTM are capable of handling big datasets.The capacity to maximize spelling correction accuracy improves with dataset size.Presenting the findings, we show how our suggested method works more accurately than previous techniques, particularly when managing intricate linguistic structures that are exclusive to the Indonesian language.Beyond the particular techniques employed, our study adds value by shedding light on the difficulties associated with correcting spelling in Indonesian and laying the groundwork for further studies in the area of natural language processing.Our results can be used by other scholars working in this field to improve their approaches to spell checking and correction specifically for the Indonesian language, which will ultimately help to advance NLP applications in the area.

RESEARCH METHOD
In this study, a model will be built for checking and correcting spelling in Indonesian utilizing the Peter Norvig, N-gram, and LSTM approaches.This study intends to improve spelling correction accuracy through a systematic series of processes.Figure 1 depicts the stages of this study.The first stages involve developing a corpus and collecting test data, followed by data preprocessing techniques such as tokenization and case folding.Following that, several spelling correction methods, such as those proposed by Peter Norvig, N-gram, and LSTM, will be built and applied to the test data.Each method is based on its own set of principles.Performance evaluation is conducted by comparing the accuracy of each method, and the one yielding the best results is identified.This research focuses on the development and evaluation of a spelling correction model with the goal of reaching the highest level of correctness.

Corpus Development
The corpus dataset used for spelling correction comparison is obtained from the "Worschatz Leipzig" website, with reference data of 10,000 words in the format of a text file (.txt).This website offers services in a growing number of languages under the name

Kusuma, A T A, and Ratnasari C I, Comparison of Spell … 216
Leipzig Corpora Collection.The site provides the most extensive publicly available text resources in many languages.The selection of this dataset is based on the completeness and comprehensiveness of the word sources it contains in the Indonesian language.The reason for choosing this corpus is to validate sentences or words that are correct in the KBBI.

Collecting Test Data
This research's data was gathered utilizing a method.To have a more thorough grasp of the data being processed, data was gathered and examined [9].The test data used to compare several approaches, such as Peter Norvig, N-gram, and LSTM, was gathered from the Kaggle website named SPECIL (Spell Error Corpus for Indonesian Language).This corpus can be used by practitioners and academics to identify and correct spelling mistakes in the Indonesian language.This study's data has a total of 21,500 entries.

Table 1. List of correct words and misspelled words in the corpus
In this research, as shown above in Table 1, we utilize data categories such as insertion, deletion, transposition, and substitution.Each category signifies specific types of errors encountered in the text; the explanation of each category is as follows: a. Deletion An algorithm is used to eliminate characters from incorrect words.Before adding the created term to the list of ideas, the algorithm eliminates one character from the word and verifies that it is accurate.For every letter in the word, the procedure is repeated.

b. Insertion
This method fixes typos in words that have a missing character.The basic idea behind this method is to put a letter from the alphabet in the spot where the error happened and then verify that the resultant word is accurate before adding it to the list of potential words.c.Substitution The substitution algorithm takes a word and substitutes one letter for another in the alphabet, then tests to see whether the resultant word makes sense before adding it to the list of suggestions.

d. Transposition
The algorithm changes a single letter in a word by inserting it in every other location.Before adding the newly formed word to the recommendation list, it verifies that it is accurate each time.The procedure is carried again once more for each letter in the word [10].
This formula shows how to calculate the edit distance between two strings, [][] referring to the edit distance algorithm's dynamic matrix cells,  represents the row of the matrix, and  represents the column of the matrix.The symbol (, ) represents the delta function or a function that measures the similarity or difference between two characters [3].

Pre-processing
Before text data undergoes data processing, there are several preprocessing steps to obtain keywords [11].Case folding and tokenization are two of the preprocessing stages [12].Case folding is the process of transforming a word's characters into their most basic form.Changing the composition, which includes capital and lowercase letters, to a uniform form first makes it easier to correct misspelled writing.In table 2, retain consistency in the letter forms; this typically entails changing all of the characters to lowercase [13].Sometimes writing faults cause a composition that includes capital letters or similar characters to lack coherence [14].
Tokenization is the process of tokenizing a sentence, paragraph, or text by dividing it up into individual words or smaller sections.Especially for agglutinative languages, it is an essential step in building a highly accurate spelling error detection model [15].

Tokenization Example
Tokenizer built around words Indonesia: ["i", "n", "d", "o", "n", "e", "s", "i", "a"] In Table 3, the process described involves breaking down the input text into segments, or tokens.This considers the sequence of the tokenized text while also removing certain characters, such as punctuation.The results of this tokenization procedure are individual words [15].

Spelling Correction
Spelling checkers are computer-based programs designed to identify and correct word mistakes.Users can employ spelling checkers to detect errors in words.The spelling checker searches the manuscript for all kinds of errors, flags them, notifies the writer of the faults, and provides ideas for fixing them [16].Spelling correction is one tool that may fix spelling mistakes.Errors might happen because there are too many or too few characters, or because certain characters are inappropriate [17].
Typographical errors in the text result in a string with more, fewer, or different characters than the text that corresponds to the vocabulary.Three primary steps are often involved in spelling error detection and correction: lexicon preparation, candidate creation, and string correction, depending on the intended term and context [18].

Peter Norvig Method
The Peter Norvig technique forecasts the likelihood of a relationship between the typographical word and the words in the corpus using probability.The method will look for word candidates that are close to the actual word using candidate models such as splits, deletion, transpostion, substitution, and insertion.Peter Norvig In order to find the proper words and match the words in the corpus based on likelihood, Spelling Corrector will search for a word's character combination.At the splits step, the typo word will be divided into the left and right words [19].Peter Norvig The most comparable spelling correction  for the word  may be chosen using Spelling Corrector's word corrector feature.Since probability is just suggested, none of the word possibilities are 100% chosen.Following the equation, the formula looks for the correction  among all potential candidate corrections that maximizes the likelihood that  is the targeted adjustment with the original word .

𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛(𝑤) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑐𝜖𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 𝑃(𝑐|𝑤) (2)
Based on the Bayes Theorem, this equals: We may delete it and write P(w) as follows since it is the same for every potential candidate c: Based on the above equation, there are four parts: (1) , is used to select the candidate whose sum of probabilities is greatest; (2)  ∈ , shows the word "c" for the candidate in the candidate set; (3) (), the likelihood that candidate c will show up in a corpus of documents; (4) (|), shows the likelihood that candidate c intended text is the word w.

N-gram Method
N-gram is a technique for locating misspelled words in large text volumes.A consecutive sequence of N items, such as words, characters, syllables, or phonemes, is known as an N-gram.For instance, bigram (2-gram) is a series of two words, like "apa kabar," "dunia lain," and "makan besar."N-gram frequencies are recorded in an n-dimensional matrix, which is used to perform a check.The system marks the word as misspelled if it discovers an uncommon or nonexistent n-gram; otherwise, it does not [20].
Rather than matching every word in a text to a dictionary, in this study, the N-gram is examined.Long sentences can have their probabilities calculated by breaking them up into smaller chunks and using the conditional probability rule to get the total probability.Word-gram-level similarity comprehension is used to find and correct misspelled words [21].The formula for determining the probability of N-grams is as follows: X is the word count of a sentence, and N is the quantity of N-grams.The formula for determining the probability bigram is as follows: P is the probabilities of N-gram, w is word, n is the index, and c is the frequency of words in a bigram.

LSTM Method
In addition to solving the exploding and disappearing gradient issues that the fundamental RNN design experienced, the LSTM technique has gained favor in recent years due to its overall superior performance over the RNN architecture [22].When detecting spelling errors, the model may assess the prior character or word components in addition to the subsequent ones because of the LSTM architecture's recurrent connections.By breaking words up into letters, an LSTM-based model and a character-based tokenizer were employed.When compared to previous seq2seq models [15].A natural extension of feedforward neural networks to sequences is the recurrent neural network (RNN).A conventional RNN iterates the following equation to compute a succession of outputs ( 1 , … ,   ), given a sequence of inputs ( 1 , . . .,   ): Kusuma, A T A, and Ratnasari C I, Comparison of Spell … 218 ℎ  = ( ℎ   +  ℎℎ ℎ −1 ) (7)   =  ℎ ℎ  Because of the vanishing gradient problem, RNNs have difficulty handling long-term dependency in the data.Recurrent neural networks with Long Short Term Memory (LSTM) are used to tackle this issue.The use of an encoder and decoder in LSTM simplifies the problem.The encoder, an LSTM working at the character level, processes the input sequence as a series of vectors.Each vector represents the meaning of characters in the sequence that has been read up to that point.On the other hand, the decoder is a character-level LSTM recurrent network with attention.It takes the final hidden state of the character-based LSTM encoder as its input [23].

RESULT AND DISCUSSION
The spelling correction model is developed using Python programming language version 3.10.2with Visual Studio Code (VSC) software, utilizing the Jupyter Notebook extension.We conduct a comparison analysis test of several spelling checking methods using SPECIL data (Spell Error Corpus for Indonesian Language), which includes 4 documents for insertion, deletion, transposition, and substitution errors.The testing dataset comprises a total of 150 words.Meanwhile, the Peter Norvig and N-gram methods use 150 words as well, with the corpus reference from the 'Leipzig Corpora Collection'.Based on the experiment, Peter Norvig and N-gram calculated the probabilities of words entered into the system and identified the highest percentage for the most favorable probability of being considered as a candidate.The n-gram technique itself makes use of Bigrams, which are made up of two-word tokens, and unigrams, which are made up of single-word tokens.Norvig and n-gram cannot find all the suggestions present in the corpus, and they are unable to provide suggestions for misspelled words.The LSTM method involves a comprehensive set of steps to significantly enhance correction accuracy.Initially, text data undergoes tokenization and pre-processing to ensure readiness for subsequent stages.Following validation and fine-tuning, the model is tested using independent test data to ensure robust performance beyond the training sample.The primary advantage of LSTM lies in its capacity to provide accurate and contextual spelling corrections.Optimizing and successfully implementing the tested model can enhance spelling correction quality across various application contexts and text environments.The three methods, including LSTM, share the common limitation of requiring complete data for effective training.Specifically, LSTM's higher computational efficiency comes with the need for a more substantial and dense dataset.The choice among these methods should balance model complexity, data completeness, and computational resources for optimal performance.Table 4. Examples of spelling correction results using three models Table 4 explains that the study's findings demonstrate a highly notable distinction between the LSTM (Long Short Term Memory), N-gram, and Peter Norvig algorithms.Experiments conducted on spelling correction yielded quite satisfactory results.For the implementation of the LSTM method, the data utilized is referenced from SPECIL (Spell Error Corpus for Indonesian Language) with a total of 150 data points.In the dataset used, only data labeled as 'insertion' is employed.However, it's worth noting that the sentences within the .csvfile are randomly selected, which has resulted in a lower accuracy for the LSTM compared to the accuracy of the other two methods.Figure 2 illustrates the accuracy achieved in each experiment.Peter Norvig achieved 89% accuracy with a calculation speed of 35 words per second, correcting 387 words, while 5% word remained unknown.Meanwhile, N-gram achieved the second-best result, obtaining 75% accuracy with a calculation speed of 21 seconds per word, correcting 150 words, and leaving 11% word unknown.On the other hand, the LSTM algorithm achieved a model accuracy of 67% using 1000 epochs.

CONCLUSION
In conclusion, the assessment of Indonesian spellchecking methods reveals that Peter Norvig's approach emerges as the most effective, boasting an impressive accuracy rate of 89%.Subsequently, the N-gram method secures the second position with 75% accuracy, while LSTM lags slightly behind at 74%.These findings underscore the significance of exploring diverse techniques in the realm of spell checking, with Norvig's method standing out as a frontrunner in enhancing accuracy for the Indonesian language.
Researchers and developers can leverage these insights to make informed decisions about the advancement of spell-checking technologies tailored to the intricacies of the Indonesian language.The use of a reference dataset of 10,000 words provides a solid foundation for testing.Although Norvig performs the best, both n-gram and LSTM also contribute significantly.The SPICEL test dataset of 21,500 words demonstrates the robustness of the three methods against a larger dataset.This research provides important insights into the suitability and effectiveness of spell-checking methods in the context of the Indonesian language.

Figure 2 .
Figure 2. The result of the comparative analysis of spelling correction methods

Table 2 .
Examples of case-folding words