Life,  Tech

Every word you write may become evidence in court

In the last century, a mysterious bomber appeared in the United States.

From 1978 to his arrest in 1996, in 17 years, he sent 16 bombs, killing 3 people and injuring 23 others. During this period, the FBI dispatched more than 500 agents and spent millions of dollars, but the “Unabomber” (Unabomber) was not caught.

It appears that this was a highly intelligent criminal who played a cautious role: he chose his targets at random and left almost no traceable evidence on the scene, such as fingerprints, hair or other fibers, including bomb-making materials. Where to buy it, for example the wood used looks like curbside rubbish.

The clue to solve the case finally came down to the language style of the “bomber”.

In 1995, the FBI received a letter. The sender stated that the bombs were sent by himself, and requested the publication of a paper of his entitled “Industrial Society & Its Future” on the condition of “stopping the bomb attack”. The sender claimed that the paper could explain his motives and views on social ills.

After a debate about whether to give in to terrorists or not, the FBI decided to release the paper in the hope that someone would recognize the author.

This paper declares that modern technology and industrialization have seriously eroded human society, so someone needs to stand up to stop technological progress, so as to save all mankind. These extremist remarks sparked widespread discussion, and the “bomber” was regarded as a “hero” by many extremists and anarchists; at the same time, it also attracted the attention of the public.

Soon, a crowd named David Kaczynski contacted the police through a lawyer, saying that he found that the viewpoint and writing style of this paper were very similar to those of his brother Ted Kaczynski ( Ted Kaczynski), and provided some old correspondence and article material.

FBI experts conducted a linguistic analysis of the material and found that, in addition to the “technologically culpable” argument, there were many writing styles consistent with the published Bomber papers, such as formatting, punctuation, and unique spelling (the Bomber papers Before it was released, the FBI had noticed that the word “analyse” was spelled in the British spelling in the paper) — but that wasn’t enough evidence for them to sign a search warrant.

The key evidence comes from a letter, which contains the sentence “you can’t have your cake and eat it, too”, which is also used in paragraph 185 of the Bomber’s paper. Taking this as a breakthrough, the police finally found and arrested the bomber himself, Ted Kaczynski, in a remote log cabin in Montana, USA.

Statistics show that Ted Kaczynski’s IQ is as high as 167. He was admitted to the Department of Mathematics of Harvard University at the age of 16, and became the youngest assistant professor of Mathematics Department in the history of the University of California, Berkeley at the age of 25. Such a crazy high school student The IQ anti-social bomber was finally exposed by his own writing style .

“You have to take a taxi to get to your destination”

In criminal investigation work, there are many ways to trace a person’s identity, such as fingerprints, irises, DNA, etc., all of which are unique identification marks.

In fact, language and writing style can also be used to confirm identity. For example, the bomber above was recognized by his brother for his writing style. The FBI investigator who investigated the case once said: ” No two people write alike (No two people write alike) .”

When a person is writing or typing, he will form some specific usage of words, and this small clue, like a fingerprint on the text, allows us to tell who the text came from. This technique of using the writing features of an article as a “fingerprint” to judge the author is called ” Author Verification “. There is a science called “Forensic Linguistics”, which is dedicated to the study of written or language expressions to analyze the identity information of suspects or victims in cases.

In 2018, the New York Times published an anonymous article “I Am a Resistor in the Trump Administration”. The author claimed to be a staff member of the White House and criticized the American political circle at that time. This made the then President Trump very angry, saying that he must find out this “inner ghost”. This is where author identification comes in handy. Someone discovered that the “North Star (lodestar)” appeared in this anonymous letter, and then Vice President Pence especially liked to use this word. The latter, of course, hastened to deny it.

There are many schools and techniques of author identification. For example, the language of a specific era can determine the year of the writer (“Are you GG or MM”, “Your mother calls you home for dinner”, “The horses are all clouds”, these buzzwords that were popular in those days have now become The tears of the times); the use of some specific vocabulary can also reflect the author’s occupation (for example, those who always say the words closed loop, grip, and empowerment are most likely Internet practitioners).

When analyzing these genres, most of them are based on content words. Content words, such as nouns, verbs, adjectives, etc., are generally used to express specific meanings. However, the use of content word analysis will face a problem: the author will use different systems of content words when writing different content. For example, a biologist writes romance novels on the side. When working, he will use a lot of biological terms; and when coding on the night shift, he may use a lot of erotic words. Therefore, when using content words to judge the author, it is easy to be disturbed by the conversion between different styles.

Compared with content words, function words such as adverbs, prepositions, and conjunctions usually have no definite meaning. Even when writing articles with different themes, the frequency of use of function words remains roughly the same . Some people have done statistics. In Chinese articles, the frequency of the word “De” is about 0.45 times for every 10 characters. It is one of the most commonly used Chinese characters by everyone. Similarly, the frequency of function words such as “地”, “得”, “你”, and “味” is hardly affected by the content of the article, and can better reflect the author’s writing habits.

In China, the most well-known case of author identification is the “unsolved case of 40 authors after “A Dream of Red Mansions”. There are 120 chapters in the whole book of A Dream of Red Mansions. It is generally accepted that Cao Xueqin wrote the first 80 chapters, and Gao E wrote the last 40 chapters. In 1970, Zhao Gang, a red scholar, used the frequency of occurrence of the five words “de”, “le”, “zai”, “er”, and “zhu” to study the author of the Dream of Red Mansions, and obtained the first 80 chapters and The last 40 chapters were indeed written by different people.

Five words were used in this study, three of which are function words.

Using Algorithms to Prove “You Are You” in the Literary Circle

Similar to Cao Xueqin, there are also a few great writers abroad who need later generations of scholars to review their works, such as Shakespeare, a British writer.

Many literary researchers believe that some of Shakespeare’s works were actually written by others, including the masterpiece “Henry VIII”.

“Henry VIII” is the late work of Shakespeare. In his later years, Shakespeare served as playwright for the King’s Men Company. After his death, John Fletcher took over the position. Therefore, some people “reasonably” doubt that Fletcher continued to write, and even revised “Henry VIII”.

In 1850, the literary critic James Spedding provided some evidence that in the Henry VIII manuscript, ye was sometimes used instead of you, or em instead of them, and these were Fletcher’s Writing habits.

Of course, related claims have been controversial. Even if they were co-authors, it is impossible to judge how much Shakespeare and Fletcher each contributed to this work.

However, with the advancement of technology, especially the maturity of machine learning algorithms, some people want to use new methods to solve the mystery of “Henry VIII”.

In 2019, a researcher named Petr Plecháč said he had an answer. The scholar from the Czech Academy of Sciences used machine learning algorithms to identify each line of text in the script, and then let the machine determine who the author was.

In order to train the algorithm and obtain an optimized model, Peter first dug out other Shakespeare works of the same period as “Henry VIII”, including “The Winter’s Tale” and “The Tempest”. He then turned the writings into rows of data, fed them to an algorithm, and made the program recognize Shakespeare’s words and sentence patterns. Similarly, Peter also found a lot of scripts written by Fletcher to let the algorithm learn. In the end, this trained algorithm can become a referee to determine who wrote the content of “Henry VIII”.

The results of the artificial intelligence analysis confirmed Spalding’s guess that Fletcher was indeed involved in the writing of “Henry VIII” . Moreover, according to the analysis of the algorithm, Fletcher’s contribution is not small, and about half of the scripts are written by him. Even, the algorithm pinpointed which passages were written by Shakespeare himself and which by Fletcher. For example, the algorithm analyzed Act II, Scene III, the first 1261 lines were written by Shakespeare, while the author of lines 1261~1299 was Fletcher, and then switched back to Shakespeare.

Of course, the truth has long been buried in the torrent of history, and all modern scholars can do is to make reasonable guesses based on probability. For famous writers, even if there are knife catchers in some works, their literary status cannot be shaken.

However, the situation faced by some writers is even more embarrassing, because some posterity suspects that all their works are not written by themselves-the French playwright Molière suffered such a total denial.

Molière, who wrote masterpieces such as “The Miser” and “The Hypocrite”, has almost the same status in the hearts of the French people as Shakespeare has in the hearts of the British.

However, hundreds of years later, some people began to suspect that Molière had not written the script, saying that he was actually a fraudster. The reasons include: First, according to historical records, Molière was a well-known actor at that time, and he traveled almost all his life. And touring, how can I have time to write a script? Furthermore, no original manuscript signed by Molière has ever been found.

Critics also listed several possible “gunners” candidates, the most vocal of which was a playwright named Pierre Corneille. Someone even made up a “ghostwriting drama” based on this: the well-educated Cornier wrote these scripts, and then signed Molière’s name, so that he could take advantage of Molière’s star effect and make the script more popular. welcome.

The parties involved in the “suspected cloud of ghostwriting” have long been buried in the ground, unable to come out to testify. So, the detective work was handed over to the machine again.

“Why Molière most likely did write his plays”丨References[9]

In 2019, two French scholars published a paper in the academic journal Science Advances entitled “Why Molière most likely did write his plays (Why Molière most likely did write his plays) “.

Looking at the title of the paper, you know that this research must be very rigorous .

The researchers collected the works of Molière, Cornier and 10 other contemporary writers, entered these works into a computer program, and counted the frequency of use of function words by each author. In order to be accurate, they also analyzed vocabulary, affixes, grammar and other aspects, and finally extracted the writing characteristics of each author.

After massive data collection, complex statistical analysis, and optimized machine learning algorithms, the two French scholars concluded the paper with satisfaction:

“These conclusions strongly substantiate the idea that Molière indeed wrote his own plays.”

error: Content is protected !!