Posted on 8 mins read

PSA: This blog will cover other stuff besides music, I promise…this post is the fruits of a recent meander into the new ggalt, hrbrthemes and tidytext packages.1

Intro

Kanye West is responsible for some of the realest, funniest (“if you fall on the concrete thats your ass fault”…asphalt…anyone?) and inspiring lyrics in music. The strength of these words has had a profound effect on me and my generation. Can the application of sentiment analysis techniques help us understand more about his art?

I’m careful to state that this is not a sentiment analysis of the music per se - nothing in here about keys, tempo, melody or any other components of music that influence the emotional valence of the piece2 - the analysis is reflective of the lyrical content only, while the sample is limited to Ye’s solo studio albums (album editions taken from Last.fm here). Also, the analysis includes lyrics from artists featured on the records, so is representative of the body of work rather than the individual.



The Data

See the footnotes for links to full code and more information on data extraction, using the Genius and Last.fm APIs. Essentially, this is the tidied-up output of Kanye album lyrics:

## Observations: 106
## Variables: 7
## $ name     <chr> "intro", "we dont care", "graduation day", "all falls...
## $ url      <chr> "https://www.last.fm/music/Kanye+West/_/Intro", "http...
## $ duration <dbl> 19, 239, 82, 223, 69, 324, 193, 324, 289, 46, 322, 31...
## $ artist   <chr> "Kanye West", "Kanye West", "Kanye West", "Kanye West...
## $ album    <fctr> The College Dropout, The College Dropout, The Colleg...
## $ track_no <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ lyrics   <chr> " kanye, can i talk to you for a minute? me and the o...


First look

To begin with, a simple look at Ye’s lyric count by album:

The College Dropout is clearly the album with biggest (lyrical) bang for your buck, registering 13,679 words. 808s & Heartbreak and Yeezus lag the farthest behind (below five-thousand a piece, and around one-third of College Dropout), and with good reason - 808s is memorably less lyrically dense than the others, while the latter has the smallest number of tracks.

We can normalise by track length and give a measure of ‘lyrical density’ (words-per-second) across time:

album_df <- tidy_df %>% count(n(), album)

album_df_b <- df %>% group_by(album) %>% summarise(duration = sum(duration)) %>% 
    left_join(album_df, by = "album") %>% mutate(count_per_sec = n/duration)

ggplot(album_df_b, aes(x = count_per_sec, y = fct_rev(album))) + geom_lollipop(point.colour = "firebrick", 
    point.size = 2, horizontal = TRUE) + theme_ipsum(grid = "X", base_size = 14, 
    axis_title_size = 14, caption_size = 12, plot_title_size = 18, subtitle_size = 14) + 
    labs(title = "Words-per-second in Kanye West albums", x = NULL, y = NULL, 
        subtitle = "Records in chronological order", caption = "Lyric data from Genius (https://genius.com/)")

Again, as expected, 808s hangs back in the words-per-second stakes. The Life Of Pablo (TLOP) exhibits a lyrical pace more comparable to Kanye’s earlier albums (College Dropout stays top, still). It’s trickier to say for sure if his flow/delivery is a throwback from this metric alone (would be more optimal if tempo was incorporated into this equation, and remember that this figure includes breakdowns, outros, etc.) but maybe someone’s been listening…


It’s worth taking a look at how some of this looks across tracks, too.

‘Last Call’ stands lone and tall at 2,745 words, with most in and around the 500-mark. However, ‘Last Call’ is also the longest Kanye album track (12:40(!!)). As before, what happens if we normalise by track length?

Now, ‘Freestyle 4’ off of TLOP is highest - remember, this metric of lyrical density is inclusive of breakdowns and such. If you listen to this track which is just over two-minutes long, you’ll see there is literally no let-up from the big man.

For good measure, let’s end this section with a quick look at the range of lyric counts by record (hint: excuse for the dumbbell chart):

How similar are Graduation, 808s and Yeezus?

Feelings

We’ve touched on things like the word count and lyrical density of Kanye’s records. Now, lets start to explore what is actually being said.

Note: from this point on, English ‘stop words’ (e.g. and, or, but) are excluded from the analysis.

Kanye’s confirmed it, everyone - love conquers all (even sh*t). We’ll need to dig further to get into the emotion and the message contained in the tracks that embody these words, though.



I tried to quantify the amount of different types of sentiment in Kanye’s lyrics by using tidytext. In this case, I went with the NRC lexicon developed by Saif Mohammad which associates words with some sentiment categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. As a Psychology graduate with some experience in qualitative analysis (and as a music lover, first and foremost) I’m mindful that a lexicon is a simplification that will undoubtedly miss important contextual and cultural cues (especially important in understanding art that nod to lots of subcultures), but can provide us with a useful summary in evaluating the lyric’s emotional composition.

nrc <- sentiments %>% filter(lexicon == "nrc") %>% dplyr::select(word, sentiment)

albums <- sentiment_df %>% group_by(album) %>% mutate(total_words = n()) %>% 
    ungroup() %>% distinct(album, total_words)

by_album_sentiment <- sentiment_df %>% inner_join(nrc, by = "word") %>% count(sentiment, 
    album) %>% ungroup() %>% complete(sentiment, album, fill = list(n = 0)) %>% 
    inner_join(albums) %>% group_by(album, sentiment, total_words) %>% summarize(words = sum(n)) %>% 
    mutate(album_prop = words/total_words) %>% ungroup()

by_album_sentiment %>% filter(!sentiment %in% c("negative", "positive", "trust", 
    "fear")) %>% ggplot(aes(x = album, y = album_prop, group = sentiment, colour = sentiment)) + 
    geom_line(size = 1.5) + theme_ipsum(base_size = 14, axis_title_size = 14, 
    grid = "Y", caption_size = 12, plot_title_size = 18, subtitle_size = 14) + 
    scale_color_ipsum() + scale_y_percent(limits = c(0, 0.1), breaks = seq(0, 
    0.1, by = 0.02)) + labs(title = "Sentiment in Kanye West albums", x = NULL, 
    y = "Sentiment as a % of album lyric count", subtitle = "Powered by the NRC lexicon", 
    caption = "Lyric data from Genius (https://genius.com/)") + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1, size = 10)) + theme(legend.position = "top")

So, what did we learn here? It looks like Graduation’s lyrics have the highest (proportional) rate of anticipation and surprise sentiment, compared to the other records. Interestingly, this is in keeping with the themes captured in Takashi Murukami’s cover art (below), where we see ‘Dropout Bear’ being thrust out into a new world.



My Beautiful Dark Twisted Fantasy has the highest rate of anger and disgust, comparitively. Again, these are prominent in George Condo’s bastardized, surreal vision for the record.



Can we look at this record closer to see where these kinds of feelings are most prominent?

bing <- sentiments %>% filter(lexicon == "bing") %>% dplyr::select(word, sentiment)

tracks <- sentiment_df %>% group_by(name, album, track_no) %>% mutate(total_words = n()) %>% 
    ungroup() %>% distinct(name, album, track_no, total_words)

by_track_sentiment <- sentiment_df %>% inner_join(bing, by = "word") %>% count(sentiment, 
    name) %>% ungroup() %>% complete(sentiment, name, fill = list(n = 0)) %>% 
    inner_join(tracks) %>% group_by(name, album, track_no, sentiment, total_words) %>% 
    summarize(words = sum(n)) %>% spread(sentiment, words, fill = 0) %>% mutate(sentiment = positive - 
    negative) %>% ungroup()

by_track_sentiment %>% filter(album == "My Beautiful Dark Twisted Fantasy") %>% 
    ggplot(aes(x = sentiment, y = reorder(str_to_title(name), -track_no))) + 
    geom_lollipop(point.colour = "firebrick", point.size = 2, horizontal = TRUE) + 
    theme_ipsum(base_size = 14, axis_title_size = 14, grid = "Y", caption_size = 12, 
        plot_title_size = 18, subtitle_size = 14) + labs(title = "Ratio of negative/positive sentiment in \nMy Beautiful Dark Twisted Fantasy", 
    x = "Sentiment ratio (negative-positive)", y = NULL, subtitle = "Powered by the Bing lexicon", 
    caption = "Lyric data from Genius (https://genius.com/)")

The above uses the Bing lexicon to examine how positive/negative sentiment manifests by track in My Beautiful Dark Twisted Fantasy. ‘Monster’ is furthest to the left with the highest ratio of negative to positive sentiment in it’s lyrics, followed closely by ‘So Appalled’. In fact, ‘All of the Lights’ would appear to be the only real respite for ‘positive’ lyrical sentiment.

What if we apply this to the rest of the discography, and see which tracks score highest for positive/negative sentiment across the board?

It might be surprising to see ‘Amazing’ way out in front (n.b. I explain Love Lockdown’s perhaps false position in the conclusion). Remember, this is based on lyrics only - here’s a reminder of the hook…

No matter what, you’ll never take that from me, My reign is as far as your eyes can see; It’s amazing, so amazing, so amazing, so amazing, It’s amazing, so amazing, so amazing, so amazing It’s amazing, so amazing, so amazing, so amazing It’s amazing, so amazing, so amazing, so amazing It’s amazing

Some last words

We were able to answer some novel questions with this analysis. How many lyrics are in Kanye’s tracks and albums/tracks, and how are these distributed by track and within-track (lyrical density)? What lyrics are most common, and how can we summarise the emotional content of these lyrics using different categories of sentiment? How does this look at an album and track level?

Of course, we’re missing the full story by just considering words as individual units and the relationship to sentiment in this way. Lyric’s meaning often comes from their relation to each other. In future, one can try and get at this by exploring the co-occurrence of words, or consecutive sequences of words known as n-grams, and validating sentiment scores by understanding context (e.g. ‘loving’ preceded by ‘not’ in ‘Love Lockdown’ would not have been picked up here, hence why it scored unexpectedly high on ‘positive’ sentiment).

Finally, a treat for those who know. Thanks for letting me get in my zone.




  1. To keep the post concise I don’t show all of the code, especially code that generates figures. But you can find the full code here.

  2. some of these factors are captured by Spotify, as explored in this piece (a big inspiration for this post). This data was not available for Kanye’s full studio discography through the Spotify API, at the time of this analysis.

comments powered by Disqus