Spotlight on stylometric text analysis: Difference between revisions

From MandrakeWiki
Jump to navigation Jump to search
mNo edit summary
Line 1: Line 1:
==Stylometry==
Different authors have different writing styles. Like the lenght of words and sentence, the frequencies of word, the frequencies of word forms, the richness of vocabulary, the use of punctuation and on. The author can also have preferences for certain spelling variants or using certain expressions.
Different authors have different writing styles. Like the lenght of words and sentence, the frequencies of word, the frequencies of word forms, the richness of vocabulary, the use of punctuation and on. The author can also have preferences for certain spelling variants or using certain expressions.


Stylometry is the study of measurable features of style.  
Stylometry is the study of measurable features of style.  
 
===
==The Story of the Phantom==
===The Story of the Phantom===
The Story of the Phantom is a series of 15 novels, published by Avon Publications in the U.S. from 1972 to 1975, based on Lee Falk's Phantom stories.  
The Story of the Phantom is a series of 15 novels, published by Avon Publications in the U.S. from 1972 to 1975, based on Lee Falk's Phantom stories. When released the adaptor of issues 2 and 10 was not credited, and issue 15 was credited as Carson Bingham. Lee Falk did correct this using an ''"Author's note"'' in the books.
 
{| {{table}}  
{| {{table}}  
!Adapted by !!issues !!note  
!Adapted by !!issues !!note  
Line 20: Line 20:
|-
|-
|}
|}
 
====Analysis====
When released the adaptor of issues 2 and 10 was not credited, and issue 15 was credited as Carson Bingham. Lee Falk did correct this using an ''"Author's note"' in the books.
 
===Analysis===
Stylometric analysis to see if Lee Falk's correction in the ''"Author's note"'' can been confirmed.
Stylometric analysis to see if Lee Falk's correction in the ''"Author's note"'' can been confirmed.
*JGAAP = Java Graphical Authorship Attribution Program
*MFW - most frequent words
*MFW - most frequent words
*MFC - most frequent characters
*MFC - most frequent characters
*n-Grams - Sample for character 2-grams: The Phantom said = th,he,e , p,ph,ha,an,nt, etc. Sample for word 2-grams: Hello, the Phantom said = hello the,the phantom,phantom said, etc.  
*n-Grams - Sample for character 2-grams: The Phantom said = th,he,e , p,ph,ha,an,nt, etc. Sample for word 2-grams: Hello, the Phantom said = hello the,the phantom,phantom said, etc.  
*Corpus - collection of text. Here the 15 novels, from chapter 1 to the end of the novel.
*Corpus - collection of text. Here the 15 novels, from chapter 1 to the end of the novel.
====JGAAP====
=====Using JGAAP=====
The novels were prepared adding the novels to each of the authors, leaving issues 2, 10 and 15 as unknown authors.
The novels were prepared adding the novels to each of the authors, leaving issues 2, 10 and 15 as unknown authors.  
Using character 4-grams with nearest neighbor driver with metric Cosine Distance, issues 2, 10 and 15 were compared to the known authors.  
Using character 4-grams with nearest neighbor driver with metric Cosine Distance, issues 2, 10 and 15 were compared to the known authors.  


The result were that the most likly author for: #2 is Basil Cooper, #10 is Frank S Shawn and #15 is Lee Falk.
The result were that the most likly author for: #2 is Basil Cooper, #10 is Frank S Shawn and #15 is Lee Falk.
====RStudio====
====Using R====
The novels were put into one corpus folder.  
The novels were put into one corpus folder.  
Two analysis were done: first 100-1000 MFW 2-gram and the second 100-1000 MFC 4-grams. Both using the Boostrap Consensus Tree.   
Two analysis were done: first 0-1000 MFW 2-gram and the second 0-1000 MFC 3-grams. Both using the Boostrap Consensus Tree.   


The result grouping the novels according to the table above, confirming Lee Falk's correction in the ''"Author's note"''.
The result grouping the novels according to the table above, ''confirming'' Lee Falk's correction in the ''"Author's note"''.
<gallery>
<gallery>
Image:RStudio-Avon-01.jpg|''MFW 2-gram''
Image:RStudio-Avon-01.jpg|''MFW 2-gram''
Image:RStudio-Avon-02.jpg|''MFC 4-gram''
Image:RStudio-Avon-02.jpg|''MFC 4-gram''
</gallery>
</gallery>
Interesting the analysis grouping issues 13 and 14 with statistic similar style.  
Interesting the analysis grouping issues 13 and 14 with statistic similar style.
 
==Short stories and an one act drama==
===Analysis===
In this analysis the corpus consists of 5 texts by Lee Falk: two short stories: "[[Spotlight on Lee Falk - Other writings - The Picture Man|The Picture Man]]" ''(1937)'' and "[[Spotlight on Lee Falk - Other writings - Time is Money|Time is Money]]" ''(1975)'', an one act drama: [[Spotlight on Lee Falk - Other writings - Eris|Eris]] ''(1966)'', and two of the Phantom novels: issue 1 ''(1972)'' and issue 15 ''(1975)''.
In addition 5 texts by Ron Goulart: three short stories: "Shandy" ''(1958)'', "Ignatz" ''(1960)'' and "Subject to Change" ''(1960)'' and two Phantom novels (pen name Frank S. Shawn in this series) issue 4 ''(1973)'' and issue 11 ''(1974)''.
 
The corpus is is a mix of different genres: 4 novels with the Phantom as the main character, 4 science fiction short stories, one mystery /science fiction short story and one drama.
 
====RStudio====
The text were put into one corpus folder. Two analysis were done: first 100-1000 MFW 2-gram and the second 100-1000 MFC 4-grams. Both Cluster Analysis using Cosine delta.
 
The MFW analysis group the shorter texts in relation to the respective authors. As for the novels, these are grouped correctly in relation to the authors, but independently of the shorter texts. This may be due to different lengths of the texts or that the novels are within the ''Phantom genre style''.
 
The MFC analysis group Ron Goulart's short stories and novels correctly, but wrongly include both the "Eris" and "The Picture Man" by Lee Falk. Interesting Lee Falk's short story "Time is Money" is listed with his two Phantom novels.
 
<gallery>
Image:RStudio-Short-01.jpg|''MFW 2-gram''
Image:RStudio-Short-02.jpg|''MFC 4-gram''
</gallery>


==A Short story, an one act drama and four Sunday stories==
===Analysis===
In this analysis the corpus consists of 6 texts by Lee Falk: the short story: "[[Spotlight on Lee Falk - Other writings - The Picture Man|The Picture Man]]" ''(1937)'', the one act drama: [[Spotlight on Lee Falk - Other writings - Eris|Eris]] ''(1966)'', two the Phantom Sunday stories; "The Beachcomber" ''(1940)'' and "The Childhood of the Phantom" ''(1944-1945)'' and two Mandrake Sunday stories; "[[The Ghost Bear of Glass Mountain]]" ''(1939)'' and "[[The Theatre Mysteries]]" ''(1940)''.


The text in the comic strips are a bit different. In a short story the the dialogues looking about this: "Hello," said the Phantom. But in the Sunday stories the speech bubble are more like this: Hello! To figure out who is saying what the Sunday stories are prepared like this: Phantom: Hallo!, Narrative: The Phantom smiling and... etc. This can cause the result of the MFC n-grams might be affected.




[[Category: Spotlight on|Stylometry]]
[[Category: Spotlight on|Stylometry]]

Revision as of 19:41, 10 October 2020

Stylometry

Different authors have different writing styles. Like the lenght of words and sentence, the frequencies of word, the frequencies of word forms, the richness of vocabulary, the use of punctuation and on. The author can also have preferences for certain spelling variants or using certain expressions.

Stylometry is the study of measurable features of style.

=

The Story of the Phantom

The Story of the Phantom is a series of 15 novels, published by Avon Publications in the U.S. from 1972 to 1975, based on Lee Falk's Phantom stories. When released the adaptor of issues 2 and 10 was not credited, and issue 15 was credited as Carson Bingham. Lee Falk did correct this using an "Author's note" in the books.

Adapted by issues note
Basil Copper 2, 3 #2 The adaptor is not credited
Carson Bingham 14
Frank S. Shawn 4, 5, 7, 8, 10, 11 #10 The adaptor is not credited
Lee Falk 1, 6, 9, 12, 15 #15 is wrongly credited as Carson Bingham
Warren Shanahan 13

Analysis

Stylometric analysis to see if Lee Falk's correction in the "Author's note" can been confirmed.

  • MFW - most frequent words
  • MFC - most frequent characters
  • n-Grams - Sample for character 2-grams: The Phantom said = th,he,e , p,ph,ha,an,nt, etc. Sample for word 2-grams: Hello, the Phantom said = hello the,the phantom,phantom said, etc.
  • Corpus - collection of text. Here the 15 novels, from chapter 1 to the end of the novel.
Using JGAAP

The novels were prepared adding the novels to each of the authors, leaving issues 2, 10 and 15 as unknown authors. Using character 4-grams with nearest neighbor driver with metric Cosine Distance, issues 2, 10 and 15 were compared to the known authors.

The result were that the most likly author for: #2 is Basil Cooper, #10 is Frank S Shawn and #15 is Lee Falk.

Using R

The novels were put into one corpus folder. Two analysis were done: first 0-1000 MFW 2-gram and the second 0-1000 MFC 3-grams. Both using the Boostrap Consensus Tree.

The result grouping the novels according to the table above, confirming Lee Falk's correction in the "Author's note".

Interesting the analysis grouping issues 13 and 14 with statistic similar style.