All articles

Linguistics 101: An introduction to linguistic analysis

People often ask me, “What is it that you do for work?” and depending upon how well I know the person my answer can vary quite a bit. I manage the team responsible for gathering and analyzing data, which is used for the development of natural language machine learning (e.g.: language models, named entity recognition, information extraction, classification, etc.). But if you are not familiar with the world of computational linguistics, then that explanation might seem like I’m speaking a foreign language. Therefore, I will take a step back and give a higher-level overview of what exactly linguists like myself at Applica do all day in order to shed some light on this fascinating, but often misunderstood field.

Linguistics is a combination of humanities and science (such as mathematics and computer science) and linguists are used to search for information from various fields and to analyze texts from different disciplines. As the Applica Linguist Team Leader, I have been here since our early days (2013) and it’s truly amazing how much has changed in my role and the field at large just in these past seven years. Linguistics, like so many other fields of study, is evolving at a rapid pace. As a critical part of the Applica team, we work on datasets (e.g.: collection of documents) that are used in machine learning. Afterwards our computer scientists prepare models that are used for analyzing client data based on these datasets, which is then added to the Applica Robotic Text Automation (RTA) platform, and ultimately used on the front end by our clients to scrutinize, comprehend, and process their documents in an automated manner. 

The main reason why linguists analyze datasets is to ensure that the datasets are suitable for use. For example, in the case of extraction, to see what data/information may be extracted or retrieved, and in the case of classification, to propose a classification.

Datasets can be analyzed with the use of a few different methods:

  • Rule-based approach (when linguists actually create rules or formal grammar)
  • Machine learning (ML) approach (when linguists prepare train and test sets)
  • Hybrid approach (when machine learning is supported by additional rules)

At the beginning of my tenure at Applica, we mostly used the rule-based approach supported by classic ML techniques, but since 2017 we have mostly used Deep Learning techniques supported by a classic ML approach. This combination of linguistic approaches with machine learning is ideal for the type of document processing we do at Applica, especially with unstructured documents, as linguists provide necessary language understanding and structure. ML then leverages this linguistic structure to extract precise information and insights from the datasets.

The ML linguistic approach can actually be broken down further into two subsets:

  • Unsupervised ML – there is no need for linguists to work on training or testing sets, and is limited mostly to clustering, anomaly detection, and language modeling
  • Supervised ML – linguists manually or semi-automatically prepare training and test sets of good quality (aka the gold standard), such as a dataset or linguistic corpus consisting of texts and tags/annotations

Our linguistics team primarily focus on extraction and classification – which are typical problems for supervised ML. (This is because ML processes text in a “naïve” manner, e.g.: the phrase “cat catches mouse” and “mouse catches cat” look the same, even though those scenarios are quite different.)

  • Our extraction work has recently been focused on a new feature for Applica RTA called NER (named entity recognition) – a subtask of information extraction that locates and classifies named entities mentioned in unstructured text into pre-defined categories, e.g.: names, addresses, dates, amounts, etc.
  • And our classification work has been focused on iterating Applica’s proprietary 2D Contextual Awareness feature – which takes into account both textual and graphical aspects before finalizing classification results, e.g.: LIBOR reference rates, jurisdiction detection for NDA contracts, reason for discontinuance of proceeding/execution in bailiffs’ acts.

Last but certainly not least, our team of linguists are also involved in the creation of normalization rules – as this is critical to processing documents with high confidence scores. For example, the date on a contract, invoice or email can be written many different ways depending on the language, who is creating the document, what systems were used to generate the document, etc. Therefore, we convert all these ways of notating a date into one normalized format so that Applica RTA is consistently producing accurate results, regardless of the document type uploaded by the user.

An example normalization for various date formats in Polish:

After seven years in my role here, it turns out that the good-quality linguistic annotation is still the key starting point for many subsequent research-product activities.

Hopefully the preceding information gives a bit more clarification into the field of computational linguistics and how my colleagues and I are helping to shape the future of text comprehension here at Applica. I realize there is a lot more detail I could get into, but I will save that for another article. Should you have any questions or would like additional information, please feel free to contact me.