Wolfram Language

Aggregate Sentence Structures from a Large Corpus

With the Wolfram Language, it is possible to analyze large datasets with ease. This example uses ExtendedEntityClass to extract and investigate the grammatical structure of over one million sentences from the posts on the website english.stackexchange.com.

Import an EntityStore created from english.stackexchange.com.

Register the store for use in EntityValue.

For posts classified with the "single-word-requests" tag, find the 50 most commonly quoted, italicized, bolded or linked words and make a word cloud of the results.

show complete Wolfram Language input

You can investigate the site on a wider scale by examining sentence structures used in posts. Start by extending the post entity type with a property to extract simple sentences.

show complete Wolfram Language input

Use the new property to extract over one million sentences from the posts.

Find the words in each sentence by splitting on whitespace or punctuation.

The word count per sentence of written prose was conjectured to follow a log-normal distribution according to a journal article. Use FindDistributionParameters to find fitting parameters for the distribution of words in each sentence of the corpus and plot them together for comparison.

show complete Wolfram Language input

Find how often each individual word occurs.

Use DeleteStopwords to clean up the dataset.

Visualize the cleaned-up word counts in a log-log plot.

Focus on the top 50 words, using Callout to see the individual words.

Analyze all of the sentences in the corpus with TextStructure, appending results to a file as they are finished. Note that this process takes a very long time and may evaluate for multiple days.

Read in the data from the file.

Look at a specific example.

Build a function to extract the core structure of a sentence.

show complete Wolfram Language input

Extract the core structure of all of the sentences.

Find all grammatical units in the data and how often they appear.

Find transition counts for each consecutive pair of units.

Here is the number of transitions between nouns and prepositions.

Visualize how frequently each transition occurs with MatrixPlot.

show complete Wolfram Language input

Group sentences with the same structure.

Visualize the most common sentence structures in a plot.

show complete Wolfram Language input

Look at example sentences for a few interesting structures.

show complete Wolfram Language input

Create a network of some of the most common sentence structures, connecting two structures if they share a parent-child relationship through the insertion of one grammatical unit.

show complete Wolfram Language input

Related Examples

de es fr ja ko pt-br zh