Sports Data Analytics: Create an NBA Data Lake on AWS from Scratch

This project showcases how to build a powerful NBA Data Lake on AWS for analyzing basketball data, leveraging key AWS services to store, process, and query the data.

🔹 Amazon S3 – for storing raw and processed data

🔹 AWS Glue – for seamless data cataloging and ETL

🔹 Amazon Athena – for fast, SQL-based querying of the data

The data is pulled from the SportsData.io API, stored in S3, transformed with Glue, and made easily queryable using Athena.

Check out the project on GitHub: NBASportsAnalyticsDataLake

Prerequisites

Before diving into the NBA Sports Analytics Data Lake project, make sure you have the following tools and resources set up. If you're good to go, feel free to jump to the next section.

Python 3.x

This will be used for scripting AWS Glue jobs and handling data processing.
- Install it: Download Python 3.x from python.org, run the installer, and ensure you check "Add Python to PATH" during installation.
pip

Python's package manager, which you'll need to install required libraries for the project.
- Verify it’s installed: Open your terminal and run:
```
  pip --version
```
- If missing, install it:
```
  python -m ensurepip --upgrade
```
AWS CLI

Use the AWS CLI to interact with AWS services directly from your terminal.
- Install it: Download the AWS CLI from AWS CLI official page and follow the instructions for your operating system.
- Verify installation with:
```
  aws --version
```
VS Code

A versatile code editor with excellent support for Python and AWS development.
- Install it: Download VS Code from code.visualstudio.com and follow the setup instructions.
AWS Account & IAM User

To interact with AWS services (like S3, Glue, and Athena), you'll need an AWS account and an IAM user with necessary permissions.
- Create an AWS account at aws.amazon.com.
- Set up an IAM user and grant permissions for S3, Glue, and Athena services. Make sure to save the Access Key ID and Secret Access Key.
SportsData.io API Key

You'll need this key to access NBA data for the analytics.
- Sign up at SportsData.io and subscribe to the NBA API Free Trial.
- Copy the provided API key (e.g., xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) and keep it handy for later integration.

System Design

Let's dive into each component:

Local Computer:

On the local computer, we will write the Python code for the Lambda function, the CloudFormation template to automate infrastructure deployment, and scripts for Glue ETL jobs and crawlers. These files will be developed, tested locally, and then uploaded to AWS to deploy the infrastructure and process the data.

AWS S3 Bucket:

Amazon S3 (Simple Storage Service) is a scalable object storage service. In this project, we will use S3 to store both raw and processed data. Raw NBA player data fetched from the SportsData.io API will be stored in S3, and processed data will also be stored in S3 after being transformed by the Glue ETL jobs. S3 serves as the central repository for all data in the data lake.

AWS Glue:

AWS Glue is a fully managed ETL (Extract, Transform, Load) service. In this project, Glue will be used to process and transform the raw NBA data stored in S3 into a structured format suitable for analysis. Glue will perform data cleaning, transformation, and enrichment tasks before writing the processed data back into S3 in a queryable format.

AWS Glue Crawler:

Glue Crawlers automatically scan data in S3 to infer the schema and create a metadata catalog. The crawler will be used to scan the raw and transformed data stored in S3, extract the necessary metadata, and create tables in the AWS Glue Data Catalog. This metadata will be used by Athena for efficient querying.

Amazon Athena:

Amazon Athena is a serverless interactive query service that allows you to analyze data directly in S3 using standard SQL. In this project, Athena will be used to run SQL queries against the transformed NBA data stored in S3. The metadata generated by the Glue Crawler will enable Athena to understand the schema of the data, making it easy to query and analyze it without needing to move or load the data into a traditional database.

Step-by-Step Guide: Setting Up NBA Sports Analytics Data Lake

Set Up the Project Structure

Create a project directory with the following structure:

NBA_Sports_Analytics_Data_Lake/
├── src/
│   ├── nba_data_lake_setup.py        # Script to set up the data lake
│   ├── delete_aws_resources.py       # Script to delete AWS resources
│   ├── .env                          # Stores sensitive information
│   ├── .gitignore                    # Ignores sensitive files in version control

Create the `.env` File

Add your SportsData.io API key to the .env file in the src folder:

API_KEY=your_sportsdata_io_api_key

Add the Data Lake Setup Script

Create the nba_data_lake_setup.py file in the src folder. This script sets up the S3 bucket, Glue database, and Athena query configurations.

nba_data_lake_setup.py

import boto3
import json
import time
import requests
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# AWS configurations
region = "eu-west-2"  # Change if necessary
bucket_name = "nba-sports-analytics-data-lake"  # Replace with a unique S3 bucket name
glue_database_name = "glue_nba_data_lake"
athena_output_location = f"s3://{bucket_name}/athena-results/"

# SportsData.io configurations
api_key = os.getenv("API_KEY")
if not api_key:
    raise ValueError("API key not found in the environment variables.")

nba_endpoint = f"<https://api.sportsdata.io/v3/nba/scores/json/Players?key={api_key}>"

# AWS clients
s3_client = boto3.client("s3", region_name=region)
glue_client = boto3.client("glue", region_name=region)
athena_client = boto3.client("athena", region_name=region)

def create_s3_bucket():
    try:
        if region == "us-east-1":
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration={"LocationConstraint": region},
            )
        print(f"S3 bucket '{bucket_name}' created successfully.")
    except Exception as e:
        print(f"Error creating S3 bucket: {e}")

def create_glue_database():
    try:
        glue_client.create_database(
            DatabaseInput={
                "Name": glue_database_name,
                "Description": "Glue database for NBA analytics.",
            }
        )
        print(f"Glue database '{glue_database_name}' created successfully.")
    except Exception as e:
        print(f"Error creating Glue database: {e}")

def fetch_nba_data():
    try:
        response = requests.get(nba_endpoint)
        response.raise_for_status()
        print("Fetched NBA data successfully.")
        return response.json()
    except Exception as e:
        print(f"Error fetching NBA data: {e}")
        return []

def upload_to_s3(data):
    try:
        file_key = "raw-data/nba_player_data.json"
        line_delimited_data = "\\n".join([json.dumps(record) for record in data])
        s3_client.put_object(Bucket=bucket_name, Key=file_key, Body=line_delimited_data)
        print(f"Uploaded data to S3: {file_key}")
    except Exception as e:
        print(f"Error uploading to S3: {e}")

def create_glue_table():
    try:
        glue_client.create_table(
            DatabaseName=glue_database_name,
            TableInput={
                "Name": "nba_players",
                "StorageDescriptor": {
                    "Columns": [
                        {"Name": "PlayerID", "Type": "int"},
                        {"Name": "FirstName", "Type": "string"},
                        {"Name": "LastName", "Type": "string"},
                        {"Name": "Team", "Type": "string"},
                        {"Name": "Position", "Type": "string"},
                    ],
                    "Location": f"s3://{bucket_name}/raw-data/",
                    "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                    "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                    "SerdeInfo": {
                        "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
                    },
                },
                "TableType": "EXTERNAL_TABLE",
            },
        )
        print(f"Glue table 'nba_players' created successfully.")
    except Exception as e:
        print(f"Error creating Glue table: {e}")

def configure_athena():
    try:
        athena_client.start_query_execution(
            QueryString="CREATE DATABASE IF NOT EXISTS nba_analytics",
            QueryExecutionContext={"Database": glue_database_name},
            ResultConfiguration={"OutputLocation": athena_output_location},
        )
        print("Athena output location configured successfully.")
    except Exception as e:
        print(f"Error configuring Athena: {e}")

def main():
    print("Starting data lake setup...")
    create_s3_bucket()
    time.sleep(5)
    create_glue_database()
    nba_data = fetch_nba_data()
    if nba_data:
        upload_to_s3(nba_data)
    create_glue_table()
    configure_athena()
    print("Data lake setup complete.")

if __name__ == "__main__":
    main()

You should have the s3 bucket created:

and the raw data within the bucket:

Step 5: Add the Resource Cleanup Script

Create the delete_aws_resources.py file to clean up the AWS resources:

delete_aws_resources.py

import boto3
from botocore.exceptions import ClientError

bucket_name = "nba-sports-analytics-data-lake"
glue_database_name = "glue_nba_data_lake"

def delete_s3_bucket():
    s3 = boto3.client("s3")
    try:
        print(f"Deleting bucket: {bucket_name}")
        objects = s3.list_objects_v2(Bucket=bucket_name)
        if "Contents" in objects:
            for obj in objects["Contents"]:
                s3.delete_object(Bucket=bucket_name, Key=obj["Key"])
        s3.delete_bucket(Bucket=bucket_name)
        print(f"Deleted bucket: {bucket_name}")
    except ClientError as e:
        print(f"Error deleting bucket: {e}")

def delete_glue_resources():
    glue = boto3.client("glue")
    try:
        print(f"Deleting Glue database: {glue_database_name}")
        tables = glue.get_tables(DatabaseName=glue_database_name)["TableList"]
        for table in tables:
            glue.delete_table(DatabaseName=glue_database_name, Name=table["Name"])
        glue.delete_database(Name=glue_database_name)
        print(f"Deleted Glue database: {glue_database_name}")
    except ClientError as e:
        print(f"Error deleting Glue resources: {e}")

def main():
    delete_s3_bucket()
    delete_glue_resources()
    print("Resources deleted successfully.")

if __name__ == "__main__":
    main()

Step 6: Run the Scripts

Run the Setup Script:
```
 python src/nba_data_lake_setup.py
```
Run the Cleanup Script:
```
 python src/delete_aws_resources.py
```

Conclusion

In this guide, we demonstrated how to set up an NBA sports analytics data lake using AWS services such as S3, Glue, and Athena, along with data from SportsData.io. You learned how to:

Configure a Python environment for interacting with AWS.
Fetch and preprocess NBA player data.
Store raw data in an S3 bucket.
Create a Glue database and table to structure the data.
Configure Athena to query the data efficiently.

Additionally, a cleanup script was provided to remove the AWS resources when they are no longer needed, ensuring cost optimization and maintaining a tidy environment.

By following these steps, you now have a foundational understanding of building serverless data pipelines and creating a queryable data lake using AWS services. You can expand this setup by integrating more data sources, adding transformations using AWS Glue ETL, or visualizing insights in QuickSight.

Next Steps

Enhance Data Processing: Explore using AWS Glue ETL jobs to transform the data for deeper analytics.
Add Visualizations: Use AWS QuickSight to create dashboards from the queried data.
Automate the Pipeline: Incorporate AWS Lambda or Step Functions to automate the data ingestion and processing workflow.
Expand to Other Sports: Scale this framework to include data for other sports or analytics scenarios.

🌤️ Day 3: NBA Sports Analytics Data Lake Setup