Spotlight on stylometric text analysis - The Avon Novels

From MandrakeWiki
Jump to navigation Jump to search
Word Cloud - The Avon Novels

The Corpus

The Story of the Phantom

"The Story of the Phantom" is a series of 15 novels, published by Avon Publications in the U.S. from 1972 to 1975, based on Lee Falk's Phantom stories. When released the adaptor of issues 2 and 10 was not credited, and issue 15 was credited as Carson Bingham. Lee Falk did correct this using an "Author's note" page in the books, and the authors of the various novels will then look like this:

issue(s) Adapted by
1, 6, 9, 12, 15 Lee Falk
2, 3 Basil Copper
4, 5, 7, 8, 10, 11 Ron Goulart (pen name: Frank S. Shawn)
13 Warren Shanahan
14 Bruce Cassiday (pen name: Carson Bingham)

Flash Gordon

"Flash Gordon" is a series of 6 novels, published by Avon Publications in the U.S. from 1974 to 1975, based on Alex Raymond's Flash Gordon stories. When released the adaptor of the four first issues was credited as Con Steffanson, the two last one credited as Carson Bingham. Later Ron Goulart said he wrote the first three novels and Bruce Cassiday the three last ones, and the authors of the various novels will then look like this:

issues Adapted by
1, 2, 3 Ron Goulart (pen name: Con Steffanson)
4, 5, 6 Bruce Cassiday (pen name: Carson Bingham)

Supplementary text

The Avon novels only have one novel written by Shanahan and two written by Copper. To provide a better basis for comparison in the further analysis, three novellettes by these authors are part of the corpus:

  • Warren Shanahan (using his pen name: W. J. Saber): "Your Mission- Block the Brenner Pass!" and "Find and Destroy the Nazis Secret Wolf-Pack Base".
  • Basil Copper: "The Long Rest".

The Avon novels have two different protagonists (The Phantom and Flash Gordon) in stories that take place in different environments. To get an indication of whether this can affect the analysis result, four novels by Burroughs are included in the corpus. One novel with the protagonist Tarzan in a jungle environment and three with the protagonist John Carter in an environment on the planet Mars:

  • Edgar Rice Burroughs: "A Princess of Mars", "At the Earth's Core", "Tarzan of the Apes" and "Warlord of March".

Statistics

The Avon Novels

The Avon novels vary between approximately 30,000 and 56,000 words. Where Lee Falk's first novel has the most words, while Goulart's tenth novel about the Phantom has the fewest.

  • All of Goulart's novels are in the lower tier of word count, between approximately 30,000 and 37,000.
  • All the novels written by Falk are in the upper tier, between approximately 47,000 and 56,000 words.
  • Cassiday, Shanahan and Copper's novels are approximately 42,000 to 50,000 words.

Supplementary text

The novels by Edgar Rice Burroughs are approximately 49,000 and 85,000 words. And the novellettes by Copper and Shanahan are approximately 14,000 to 17,000 words.

Analysis using the 'stylo' package in RStudio

Analysis I - The Phantom novels

With the corpus of the 15 novels, from chapter 1 to the end of the novel. In this analysis the novels are named according to the name used as the author inside the novels: Bingham, Copper, Falk, Shanahan and Shawn. In fig. 1 are three novels named: Incorrectly-credited_Avon-15-LF, Not-credited_Avon-02-BC and Not-credited_Avon-10-FSS.

Using a Bootstrap consensus tree analysis for the 100 MFW[footnotes 1] to 1,000 MFW (with an incremental step size of 50 words), it branches for the authors according to Lee Falk's correction in the Author's note page.

The novels by Shanahan (13) and Bingham (14) branches close. But this is not uncommon if the corpus contains only one text by an author. To illustrate this I added some novels by Edgar Rice Burroughs. First "Tarzan of the Apes" (fig. 2), and then "A Princess of Mars" (fig. 3) and next "At the Earth's Core" (fig. 4). The Avon novels are renamed to the authors (pen-name) according to Falk's correction. When two or more novels by an author are in the corpus, the analysis branches according to the author.

Analysis II

Using the corpus with all novels and novellettes from chapter 1 to the end.

  • The first analysis is an cluster analysis for the 100 MFW using the Classic Delta Distance. (fig. 5)
  • The second analysis is the same as the first, but with 1,500 MFW. (fig. 6)
  • The third analysis is a Bootstrap consensus tree analysis for the 100 MFW to 3,000 MFW (with an incremental step size of 50 words). (fig. 7)

In the dendrogram (fig. 5 and 6) the horizontal length of the branches shows the distance between clusters or individual novels, with longer branches indicating greater dissimilarity. For instance, the Burroughs novels cluster together but are quite distinct from other clusters.

Fig. 5: Novels by the same author or within the same series tend to cluster together, indicating similar writing styles or thematic content. Copper, Cassiday, Shanahan, Goulart, Falk, and Burroughs each form distinct clusters, suggesting that each author or series has unique characteristics that distinguish them from others. The Burroughs novels form a separate cluster quite distinct from others, possibly due to the unique genre or style.

Fig. 6: The dendrogram with 1500 Most Frequent Words provides a more nuanced clustering of the novels. Novels by the same author or within the same series continue to cluster together, affirming the stylistic or thematic consistency within each author’s works. The use of a larger feature set has resulted in tighter clusters, indicating that the novels share even more similarities when more textual features are considered The Burroughs novels remain distinct from others, reinforcing their unique stylistic or thematic characteristics. Overall, this analysis highlights the strong internal consistency within each author's works and provides a clearer picture of the relationships between different novels.

The bootstrap consensus tree, which is a type of hierarchical clustering analysis used to show the stability of the clusters based on resampling (bootstrapping). It combines information from multiple bootstrap samples to generate a consensus view of the clustering structure.

Fig. 7: Clusters
Burroughs: Forms a distinct cluster, indicating a high degree of similarity among these novels. Goulart: Another distinct cluster, showing consistency within this author's works. Cassiday: Forms a cohesive cluster, suggesting similarity in writing style or content. Falk: These novels also cluster together, indicating they share common features. Copper: Forms a tight cluster, showing these novels are very similar. Shanahan: Clusters together, indicating a high degree of similarity among these novels. The radial distance from the center represents the degree of dissimilarity, with novels closer to the center being more similar. Each cluster is relatively distinct from the others, reinforcing the unique characteristics of each author's works.

Fig. 7: Distances
The radial distance from the center represents the degree of dissimilarity, with novels closer to the center being more similar. Each cluster is relatively distinct from the others, reinforcing the unique characteristics of each author's works.

Conclusion
The bootstrap consensus tree confirms and reinforces the findings from the previous hierarchical clustering analyses (fig. 5 and 6), with some additional insights into the stability of these clusters:

  • Consistency within Authors: The novels by the same author or within the same series consistently cluster together, indicating strong stylistic or thematic coherence.
  • Distinct Clusters: Each author's novels form distinct clusters, suggesting that the differences between authors are significant and consistent across bootstrap samples.
  • Stability of Clusters: The use of bootstrapping to create a consensus tree adds robustness to the analysis, showing that the clusters identified are stable and not artifacts of a specific sample.

Analysis using the Voyant Tools (local Voyant Server)

When it comes to statistics, there may be some sources of error in the corpus.

  • The Avon Novels have some typographical errors that have not been corrected.
  • The proofreading of the OCR[footnotes 2] may introduce some typographical errors.

Punctuation marks

The only authors who use parentheses in this corpus are Shanahan and Falk. Shanahan uses the parentheses eight times in his novel. Falk uses the parentheses in all his novels, varying from 15 (The Phantom #9) to 53 (The Phantom #15).

The use of exclamation marks also distinguishes the various authors:

  • Goulart: 15 - 64
  • Falk: 68 - 87
  • Copper: 110 - 207
  • Shanahan: 141
  • Cassiday: 229 - 370

Tokenization

The tokenization[footnotes 3] has a bearing on how the number of words is calculated using the Voyant Tools. This analysis uses the automatic tokenization.

Tokenization Count Tokens Notes
Automatic 3 What's, voyant, tools.org the hyphen is split but the tools.org is considered a URL token; tokens are lowercase
Word Boundaries 5 What, s, voyant, tools,org any non-word character is a delimiter, tokens are lowercase
Whitespace Only 2 What's, voyant-tools.org? punctuation is kept in tokens and case is unchanged

Note

  1. MFW = Most Frequent Words
  2. OCR = Optical Character Recognition - the process that converts an image of text into a machine-readable text format
  3. Tokenization = The process of identifying words, or sequences of Unicode letter characters that should be considered as a unitz