Analyzing Discourse and Text Complexity for Learning and Collaborating

Analyzing Discourse and Text Complexity for Learning and Collaborating

Author: Mihai Dascalu

Notes

  • …informational level, coherence is most frequently accounted by: lexical chains (Morris and Hirst 1991; Barzilay and Elhadad 1997; Lapata and Barzilay 2005) (see 4.3.1 Semantic Distances and Lexical Chains), centering theory (Miltsakaki and Kukich 2000; Grosz et al. 1995) (see 4.2 Discourse Analysis and the Polyphonic Model) in which coherence is established via center continuation, or Latent Semantic Analysis (Foltz et al. 1993, 1998) (see 4.3.2 Semantic Similarity through Tagged LSA) used for measuring the cosine similarity between adjacent phrases
  • Among chat voices there are sequential and transversal relations, highlighting a specific point of view in a counterpointal way, as mentioned in previous work (Trausan-Matu and Rebedea 2009).
  • From a computational perspective, until recently, the goals of discourse analysis in existing approaches oriented towards conversations analysis were to detect topics and links (Adams and Martell 2008), dialog acts (Kontostathis et al. 2009), lexical chains (Dong 2006) or other complex relations (Rose et al. 2008) (see 3.1.3 CSCL Computational Approaches). The polyphonic model takes full advantage of term frequency – inverse document frequency Tf-Jdf (Adams and Martell 2008; Schmidt and Stone), Latent Semantic Analysis (Schmidt and Stone ; Dong 2006), Social Network Analysis (Dong 2006), Machine Learning (e.g., Nai”ve Bayes (Kontostathis et al. 2009), Support Vector Machines and Collin’s perceptron (Joshi and Rose 2007), the TagHelper environment (Rose et al. 2008) and the semantic distances from the lexicalized ontology WordNet (Adams and Martell 2008; Dong 2006). The model starts from identifying words and patterns in utterances that are indicators of cohesion among them and, afterwards, performs an analysis based on the graph, similar in some extent to a social network, and on threads and their interactions.
  • Semantic Distances and Lexical Chains: an ontology consists of a set of concepts specific to a domain and of the relations between pairs of concepts. Starting from the representation of a domain, we can define various distance metrics between concepts based on the defined relationships among them and later on extract lexical chains, specific to a given text that consist of related/cohesive concepts spanning throughout a text fragment or the entire document.
    • Lexicalized Ontologies and Semantic Distances: One of the most commonly used resources for English sense relations in terms of lexicalized ontologies is the WordNet lexical database (Fellbaum 1998; Miller I 995, 2010) that consists of three separate databases, one for nouns, a different one for verbs, and a third one for adjectives and adverbs. WordNet groups words into sets of cognitively related words (synsets), thus describing a network of meaningfully inter-linked words and concepts.
    • Nevertheless, we must also present the limitations of WordNet and of semantic distances, with impact on the development of subsequent systems (see 6 PolyCAFe – Polyphonic Conversation Analysis and Feedback and 7 ReaderBench (I) – Cohesion-based Discourse Analysis and Dialogism): I/ the focus only on common words, without covering any special domain vocabularies; 2/ reduced extensibility as the serialized model makes difficult the addition of new domain-specific concepts or relationships
    • Building the Disambiguation Graph:Lexical chaining derives from textual cohesion (Halliday and Hasan 1976) and involves the selection of related lexical items in a given text (e.g. , starting from Figure 8, the following lexical chain could be generated if all words occur in the initial text fragment: “cheater, person, cause, cheat, deceiver, . .. “). In other words, the lexical cohesive structure of a text can be represented as lexical chaining that consists of sequences of words tied together by semantic relationships and that can span across the entire text or a subsection of it. (Ontology-based chaining formulas on page 63)
    • The types of semantic relations taken into consideration when linking two words are hypernymy, hyponymy, synonymy, antonymy, or whether the words are siblings by sharing a common hypernym. The weights associated with each relation vary according to the strength of the relation and the proximity of the two words in the text analyzed.
  • Semantic Similarity through Tagged LSA: Latent Semantic Analysis (LSA) (Deerwester et al. 1989; Deerwester et al. 1990; Dumais 2004; Landauer and Dumais 1997) is a natural language processing technique starting from a vector-space representation of semantics highlighting the co-occurrence relations between terms and containing documents, after that projecting the terms in sets of concepts (semantic spaces) related to the initial texts. LSA builds the vector-space model, later on used also for evaluating similarity between terms and documents, now indirectly linked through concepts (Landauer et al. 1998a; Manning and Schi.itze 1999). Moreover, LSA can be considered a mathematical method for representing words’ and passages’ meaning by analyzing in an unsupervised manner a representative corpus of natural language texts.
    • In terms of documents size, semantically and topically coherent passages of approximately 50 to 100 words are the optimal units to be taken into consideration while building the initial matrix (Landauer and Dumais 2011).
      • This fits nicely to post size. Also a good design consideration for JuryRoom
    • Therefore, as compromise of all previous NLP specific treatments, the latest version of the implemented tagged LSA model (Dascalu et al. 2013a; Dascalu et al. 2013b) uses lemmas plus their corresponding part-of-speech, after initial input cleaning and stop words elimination.
  • Topic Relatedness through Latent Dirichlet Allocation
    • Starting from the presumption that documents integrate multiple topics, each document can now be considered a random mixture of corpus-wide topics. In order to avoid confusion, an important aspect needs to be addressed: topics within LDA are latent classes, in which every word has a given probability, whereas topics that are identified within subsequently developed systems (A .S.A.P., Ch.A.MP., Po/yCAFe and ReaderBench) are key concepts from the text. Additionally, similar to LSA, LDA also uses the implicit assumption of the bag of words approach that the order of words doesn’t matter when extracting key concepts and similarities of concepts through co-occurrences within a large corpus.
    • Every topic contains a probability for every word, but after the inference phase a remarkable demarcation can be observed between salient or dominant concepts of a topic and all other vocabulary words. In other words, the goal of LDA is to reflect the thematic structure of a document or of a collection through hidden variables and to infer this hidden structure by using a posterior inference model (Blei et al. 2003)
    • there are inevitably estimation errors, more notable when addressing smaller documents or texts with a wider spread of concepts, as the mixture of topics becomes more uncertain