Formulaic Expression Research (定型表現の研究)

Overview

My research focus is formulaic expressions, formulaic sequences, lexical bundles, phraseology, and phrase frames used in research articles, scholarly papers, academic prose, and scientific documents. The aim of my work is computer-based academic writing assistance. I proposed a framework for retrieving formulaic expressions effectively when a user wanted to find formulaic expressions that he/she did not come up with at COLING 2018. I also proposed a fully automated method to construct lists of formulaic expressions that were assigned labels of communicative functions, communicative roles, communicative purposes, or rhetorical moves.

Extraction of formulaic expressions from scholarly papers

I have proposed two methods to extract formulaic expressions from scientific papers.

  1. sentcoreNER
    I proposed this method in the workshop on scholarly document understanding in 2021. This method consists of two steps: entity removal and n-gram extraction.
  2. NERdepparseLMI
    I modified a method presented in an arXiv paper and presented it at EACL 2021. This method comprises three steps: enity removal, dependency-structure-based word removal, and LMI-based word removal.

Assignment of communicative-function labels to formulaic expressions

I proposed a method to classify sentences into categories of communicative functions in my EACL paper. The training dataset and the resulting database are available.

Database for my SDU 2021 paper

The database is available at GitHub. This database contains a host of academic formulaic expressions extracted from four-disciplinary corpora.

Dataset/database for my EACL 2021 paper

The whole dataset/database is available at iwatsuki.coresv.com/Iwatsuki-2021-EACL.7z (340MB). This file consists of:

  1. CF-labelled sentence dataset (CC-BY-NC-SA 3.0)
    This dataset consists of sentences with communicative function labels. It is for training and evalution of supervised machine-learning.
  2. CF-labelled sentence database (CC-BY-NC-SA 3.0)
    This database consists of sentences to which communicative function labels were assigned by the SciBERT-based classifier.
  3. CF-labelled FE database (CC-BY 4.0)
    This database consists of formulaic expressions extracted from scientific papers on computational linguistics, chemistry, oncology, and psychology.

This dataset consists from scientific papers derived from ACL Anthology Sentence Corpus and PMC Open Access Subset (bulk for commercial use). The former dataset is licensed under Creative Commons CC-BY-NC-SA 3.0. We used three journals in the latter dataset: Molecules, Oncotarget, and Frontiers in Psychology. Molecules adopts CC-BY 4.0, Oncotarget adopts CC-BY 3.0, and Frontiers in Psychology adopts the latest CC-BY licence. Hence, the sentence dataset/database should be distributed under the strictest licence: CC-BY-NC-SA 3.0. The FE database does not contain any sentence, and thus it is licensed under CC-BY 4.0.

Please cite my paper when you use this dataset.
APA style:
Iwatsuki, K., & Aizawa, A. (2021). Communicative-Function-Based Sentence Classification for Construction of an Academic Formulaic Expression Database. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, xxx–xxx.
BibTeX:
@inproceedings{Iwatsuki2021b, author="Kenichi Iwatsuki and Akiko Aizawa", title="Communicative-Function-Based Sentence Classification for Construction of an Academic Formulaic Expression Database", booktitle="Proceedings of the 16th Conference of the {E}uropean Chapter of the Association for Computational Linguistics", year=2021, pages="xxx--xxx", publisher="Association for Computational Linguistics"}