Statistics, Science, Random Ramblings

A blog mostly about data and R

Splitting text into parts of unequal length in R

Posted at — Jul 25, 2019

This is the first post based on “Oh, I learned something. I should probably write it down.”

As part of a side project I am handling large amounts of text, which is initially stored as vector and each element of the vector is a paragraph. What I would like to do is to split that vector into a list based on pre-defined section names. The main problem here is that the sections are of unequal length. But fortunately, this can be solved rather elegantly.

First, build some example text using the first verse of Edgar Allen Poe’s poem A Dream Within A Dream and randomly inserting some section headings. (The full poem is available on Wikisource).

my_text <- c(
    "Part 1",
    "Take this kiss upon the brow",
    "And, in parting from you now,",
    "Thus much let me avow -",
    "Part 2",
    "You are not wrong, who deem",
    "That my days have been a dream;",
    "Part 3",
    "Yet if hope has flown away",
    "In a night, or in a day,",
    "In a vision, or in none,",
    "Is it therefore the less gone?",
    "All that we see or seem",
    "Part 4",
    "Is but a dream within a dream."
)

It is obvious that the parts are of unequal length. Now, with a text this short you could handle this manually pretty easily, but if you have several hundred texts that are much longer you need to code a solution.

Fortunately this is easy:

split_idx <- grep("Part", my_text)
split_idx
## [1]  1  5  8 14

Note that this example here is pretty simple. Depending on what you are looking for in your code you might want to do one or more of the following:

Now we can split:

split_vector <- rep(seq_along(split_idx), times = 
                        diff(c(split_idx, length(my_text) + 1)))
poem_list <- split(my_text, split_vector)
poem_list
## $`1`
## [1] "Part 1"                        "Take this kiss upon the brow" 
## [3] "And, in parting from you now," "Thus much let me avow -"      
## 
## $`2`
## [1] "Part 2"                          "You are not wrong, who deem"    
## [3] "That my days have been a dream;"
## 
## $`3`
## [1] "Part 3"                         "Yet if hope has flown away"    
## [3] "In a night, or in a day,"       "In a vision, or in none,"      
## [5] "Is it therefore the less gone?" "All that we see or seem"       
## 
## $`4`
## [1] "Part 4"                         "Is but a dream within a dream."

This is the desired result.

What has happened here? With the call to rep we created a vector containing an enumeration of the parts, each of the same length as the actual part in the data:

split_vector
##  [1] 1 1 1 1 2 2 2 3 3 3 3 3 3 4 4

This vector can then be used with the split function in base R, which splits a vector into a list based on a factor (or something that can be converted into a factor with as.factor).