Burapha-TH: A Multi-Purpose Character, Digit, and Syllable Handwriting Dataset

Athita Onuean

Uraiwan Buatoom

Thatsanee Charoenporn

Taehong Kim and Hanmin Jung

Resumen

In handwriting recognition research, a public image dataset is necessary to evaluate algorithm correctness and runtime performance. Unfortunately, in existing Thai language script image datasets, there is a lack of variety of standard handwriting types. This paper focuses on a new offline Thai handwriting image dataset named Burapha-TH. The dataset has 68 character classes, 10 digit classes, and 320 syllable classes. For constructing the dataset, 1072 Thai native speakers wrote on collection datasheets that were then digitized using a 300 dpi scanner. De-skewing, detection box and segmentation algorithms were applied to the raw scans for image extraction. The experiment used different deep convolutional models with the proposed dataset. The result shows that the VGG-13 model (with batch normalization) achieved accuracy rates of 95.00%, 98.29%, and 96.16% on character, digit, and syllable classes, respectively. The Burapha-TH dataset, unlike all other known Thai handwriting datasets, retains existing noise, the white background, and all artifacts generated by scanning. This comprehensive, raw, and more realistic dataset will be helpful for a variety of research purposes in the future.

Palabras claves

Thai language - handwriting image dataset - handwriting character recognition - Thai characters - Thai digits - Thai syllables

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 12 Parte: 8 (2022)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Applied Sciences
Information
International Journal of Interactive Mobile Technologies (iJIM)

DOI