Spotlight on stylometric text analysis - Newspaper stories

From MandrakeWiki
Jump to navigation Jump to search

Phantom Sundays

1939-1946

The Corpus

The Phantom Sundays
  • PS_002 "The Precious Cargo of Colonel Winn"
  • PS_003 "The Fire Goddess"
  • PS_004 "The Beachcomber"
  • PS_005 "The Saboteurs"
  • PS_006 "The Return of the Sky Band"
  • PS_007 "The Impostor"
  • PS_008 The Marshall Sisters Pt.1: "Castle in the Clouds"
  • PS_009 The Marshall Sisters Pt.2: "The Ismani Cannibals"
  • PS_010 The Marshall Sisters Pt.3: "Hamid the Terrible"
  • PS_011 "The Childhood of the Phantom"
  • PS_012 "The Golden Princess"
  • PS_013 "The Strange Fisherman"
  • PS_014 "Queen Pera the Perfect"
Mandrake the Magician Sundays
  • MS_012 "The Ghost Bear of Glass Mountain"
  • MS_013 "The Theatre Mysteries"
  • MS_023 "Mystery of the Girls with Red Hair"
  • MS_024 "Cloud City"
  • MS_025 "Gloria Golden"
  • MS_026 "The Garden of Wuzzu"
  • MS_027 "The Circus Adventure"
  • MS_028 "The Santa Claus Pirates"
Comics written by Alfred Bester
  • "Starman", The Menace of the Invisible Raiders! (Adventure Comics #67, October 1941)
  • "Starman", The Blaze of Doom! (Adventure Comics #68, November 1941)
  • "Starman", The Little Man Who Wasn't There! (Adventure Comics #78, September 1942)
  • "Green Lantern", The Man Who Wanted the World! (Green Lantern #10, winter 1943)
Comics written by Gardner Fox
  • "Starman", The Mystery of the Undersea Terror! (Adventure Comics #65, August 1941)
  • "Starman", The Case of the Camera Curse! (Adventure Comics #66, September 1941)

Voyant Tools

An analysis using the Voyant Tools shows that the corpus consist of 82,678 total words and 7,251 unique word forms.

In this corpus, the stories of Bester have a higher vocabulary density, a measurement of vocabulary usage in comparison to the length of a document. (Think of how many words will be read on average before a new word is encountered.)

  1 2 3 4 5
Highest: GL-10-1943 (0.405) PS_009 (0.387) AC_078 (0.380) AC_067 (0.371) AC_068 (0.365)
Lowest: PS_006 (0.198) PS_010 (0.206) PS_011 (0.223) PS_007 (0.236) PS_004 (0.236)

Comparing the readability index the childhood story has the lowest score. The readability index is an estimation of how difficult a text is to read. The estimation is made by measuring a text's complexity. Measurable attributes of texts such as word lengths, sentence lengths, syllable counts, and so on give us ways to measure the complexity of a text. The Voyant Tools uses the Coleman–Liau index, and the output is approximates the U.S. grade level.

  1 2 3 4 5
Highest: MS_028 (7.863) MS_026 (7.730) MS_024 (7.378) MS_027 (7.360) MS_025 (7.222)
Lowest: PS_011 (3.547) PS_004 (4.493) MS_012 (4.668) MS_013 (4.767) PS_014 (4.836)

JGAAP

Using the Java Graphical Authorship Attribution Program we can compare an unknown text against known texts.

We adding the texts (exept the childhood story) from the corpus under the two authors: Bester and Falk. An analysis where the childhood story is our unknown text we choose the event driver: Character NGrams (n: 10), the event culling: Most Common Events (numevents: 50) and the analysis methode: Nearest Neighbor Driver with metric Alt Intersection Distance.

In an organized list from one to twenty-four, this analysis suggests that this story was most likely written by the same author as: PS_003, PS_012, PS_002, PS_004, PS_007, PS_013, PS_008, PS_010, PS_014, PS_006, PS_005, PS_009, MS_027, MS_024, AC_067, MS_026, MS_028, MS_013, MS_023, MS_012, AC_078, MS_025, AC_068, GL-10.

We can do this analysis were Bester's Green Lantern story is the unknown text: AC_067, AC_068, AC_078, MS_013, MS_023, MS_024, MS_027, MS_028, PS_002, PS_003, PS_004, PS_006, PS_007, PS_010, PS_012, PS_014, PS_011, MS_025, MS_026, PS_005, PS_008, PS_013, MS_012, PS_009.

'stylo' package in RStudio

A stylometric text analysis of the Avon Novels using the 'stylo' package in RStudio shows that it is possible to use stylometry to identify the author of a book. The text in the comic strips are a bit different than a novels. In the novels the dialogues looking about this: "Hello," said the Phantom. But in the Sunday stories the speech bubble are more like this: Hello!

Two analysis of the Phantom Sundays were done: first 0-902 MFW 2-gram (fig. 1) and the second 0-902 MFC 3-grams (fig. 2). Both using the Boostrap Consensus Tree. The result for the MFW 2-grams grouping all stories close in writing style. The MFC 3-grams shows a slightly larger variation, but this is most likely due to the ammount of dialogues from different characters in the stories. The analysis shows no clear indication that anyone other than Lee Falk was the author of these stories.

A third analysis was done adding some Mandrake Sundays to the corpus. This analysis is a cluster analysis using the 100 MFW (fig. 3). Here one see a greater distance between the stories with the Phantom and Mandrake. But otherwise no indication of a ghost-writer.

A fourth analysis was done using the Phantom and Mandrake sundays and the four comics by Bester. This Boostrap Consensus Tree using 0-902 MFW with classic delta distance. This analysis (fig. 4) branches out for the stories written by Bester, but the childhood story is within others written by Lee Falk.

A fifth analysis (fig. 5) is a Principal Components Analysis using the 600 MFC 2-grams. In this analsysis the corpus is added two Starman stories written by Gardner Fox. Even in this analysis, the childhood (PS_011) story does not stand out from the other Sunday stories in the corpus.