Web Scraping to Find Valid Spotify Genres – Benjamin Frizzell’s Blog

Spotify uses genre tags to categorize music, powering features like song recommendations and playlist generation. Through the Spotify Web API, genres can also be used to search for tracks, artists, playlists, and more.

However, there’s a catch: while Spotify recognizes thousands of genres, it doesn’t provide a straightforward way to access a complete list of them. This becomes a problem if you’re building applications that rely on valid genre inputs. In my case, I’m building a reccomender model, and I want users to have the ability to “pre-populate” the dataset with songs from their favorite genres, which requires having an up-to-date and accurate genre list to avoid empty queries.

Fortunately, several people have already compiled these genres and published them online. But given how long the list is, manually copying it would be tedious and error-prone. Instead, I used BeautifulSoup, a simple yet powerful Python library for web scraping, to extract the genres automatically and store them neatly for later use.

import requests
from bs4 import BeautifulSoup
import pickle

We’ll start by sending a GET request to the blog page using the requests package. This will give us the page as a string of raw HTML.

# URL of the blog page
url = "https://www.spudart.org/blog/six-thousand-spotify-genres/"

# Send a GET request
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

# look at some of the text
print(response.text[:500])

<!DOCTYPE html><html lang="en-US"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="profile" href="https://gmpg.org/xfn/11"><meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' /> <style>img:is([sizes="auto" i], [sizes^="auto," i]) { contain-intrinsic-size: 3000px 1500px }</style> <!-- This site is optimized with the Yoast SEO plugin v25.3.1 - https://yoast.com/wordpress/plugins/seo/ --

The raw HTML is messy and hard to work with, which is why we use BeautifulSoup to clean it up for us.

# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")

# Look at some of the text
print(soup.get_text()[:200])

  6,119 genres on Spotify: try exploring them all - Spudart                                       Skip to content  FacebookBlueskyComicBlogNotepadArt projectsAboutContactEmail newsletter  Search    Se

The text comes as one line, so we’ll clean up by splitting on \n newline characters and removing leftover blank lines:

# Find the main content section
text = soup.get_text(separator="\n").splitlines()

# remove blank lines
text = [line.strip() for line in text if line.strip()]

# print some of the cleaned text
text[:20]

['6,119 genres on Spotify: try exploring them all - Spudart',
 'Skip to content',
 'Facebook',
 'Bluesky',
 'Comic',
 'Blog',
 'Notepad',
 'Art projects',
 'About',
 'Contact',
 'Email newsletter',
 'Search',
 'Search for:',
 'Main Menu',
 'Search',
 'Search for:',
 'Comic',
 'Blog',
 'Notepad',
 'Art projects']

Looking at the website in a browser, we know that the list of genres starts after the line: “All 6,119 Spotify genres” and ends before: “Enjoyed this blog post?”, so we’ll get all of the content between these two lines.

# filter for just genres
start = text.index('All 6,119 Spotify genres')+1
end = text.index('Enjoyed this blog post?')
genres = text[start:end]

# should have 6,119 genres according to the blog post
assert len(genres) == 6119

# show some of the genres
print(genres[:10])
print(genres[-10:])

['21st Century Classical', '432hz', '48g', '5th Wave Emo', '8-bit', '8d', 'A3', 'Aarhus Indie', 'Aberdeen Indie', 'Abstract']
['Zither', 'Zohioliin Duu', 'Zolo', 'Zomi Pop', 'Zouglou', 'Zouk', 'Zouk Riddim', 'Zurich Indie', 'Zxc', 'Zydeco']

This looks pretty good! But just as a quick sanity check, we’ll make sure all of these genres are indeed actually recognized by Spotify. We’ll do this by querying the API for songs from each genre to see if we get anything. But before we can do that, we need to get an access token from Spotify, since it uses OAuth 2.0 authorization. I won’t focus on the details of how this authorization works, but you can read more about it here.

def get_spotify_access_token(CLIENT_ID: str, CLIENT_SECRET: str):
    '''
    Retrieve an access token from Spotify using OAuth 2.0 authorization.
    '''
    
    auth_str = f"{CLIENT_ID}:{CLIENT_SECRET}"
    b64_auth_str = base64.b64encode(auth_str.encode()).decode()
    
    headers = {
    "Authorization": f"Basic {b64_auth_str}",
    "Content-Type": "application/x-www-form-urlencoded"
    }
    
    data = {"grant_type": "client_credentials"}

    response = requests.post(
        "https://accounts.spotify.com/api/token", 
        headers=headers, data=data
        )
    
    token = response.json()["access_token"]
    
    return token

from dotenv import load_dotenv
import os
import base64
import urllib.parse

load_dotenv()
TOKEN = get_spotify_access_token(os.getenv("CLIENT_ID"),os.getenv("CLIENT_SECRET"))
BASE_URL = 'https://api.spotify.com/v1/search'
headers = {"Authorization": f"Bearer {TOKEN}"}

Now we can cycle through the list of genres and filter out the unrecognized ones. For each genre in the list, we will:

Send a GET request to Spotify for at least one track tagged under the genre.
Check if any tracks are returned. If not, we won’t include the genre in our final list.

def is_valid_genre(genre):
    '''
    Determine if a given genre is recognized by Spotify,
    by whether or not it returns tracks for a query.
    '''
    params = {
    "q": f"genre:{genre}",
    "type": "track",
    "limit": 1
    }
    response = requests.get(BASE_URL,params=params,headers=headers)
    
    if not response.json()['tracks']['items']:
        return False
    
    return True

valid_genres = list(map(is_valid_genre,genres))

genres = [genre for genre,is_valid in zip(genres,valid_genres) if is_valid]
len(genres)

It’s a good thing we checked the list: there were about 120 genres in the original list that didn’t return any tracks, so we threw those out.

We’re done! Now, we can pickle the genre list to use later.

# store genres list
with open('../assets/valid_genres.pkl','wb') as f:
    pickle.dump(genres,f)

Conclusion

In this notebook, I scraped a complete list of valid Spotify genres from an online source using requests and BeautifulSoup.
Because Spotify doesn’t publish its genre catalog directly through the Web API, having an up-to-date list is essential for building tools that rely on valid genre inputs - such as the personalized recommender I’m working on.

By automating the extraction instead of copying manually, we ensure the data is accurate, reproducible, and easy to refresh if the source ever changes. This approach demonstrates how simple web scraping can be a powerful tool for supplementing gaps in public APIs.