Analysing WhatsApp Data with BigQuery and Looker Studio

WhatsApp is one of the most popular messaging apps in the world, with over 2 billion active users. It’s no surprise that many people want to analyse their WhatsApp data to gain insights into their conversations, including myself! In this post, I’ll detail how I used Python and BigQuery to analyze my WhatsApp data, covering everything from parsing the raw TXT export to visualizing the data. Let’s get started!

Parsing the TXT Export

Within WhatsApp it is possible to export chat data to a TXT file, by opening the options within individual or group chats, tapping “More” then “Export chat”. In order to take this file and upload its contents to BigQuery, I had to parse it into a format supported by BigQuery for uploads. Here is how the chat data appears in the TXT file:

12/07/2022, 11:14 pm - Steve Jobs: Hello my name is Steve.
12/07/2022, 11:17 pm - Bill Gates: Hi Steve, it's nice to meet you.

In this file there is the date, the time, the name of the message sender and the content of the message. With this the three key fields for BigQuery can be created: datetime, author, content. There is some formatting in the TXT file to help parse each row into these fields which is described below. All of the initial parsing and cleaning of the exported file will be done with Python.

Extracting each field from a line of data

Extracting the datetime from a line of chat log data was done in the following way:

import re
from datetime import datetime

line = "12/07/2022, 11:14 pm - Steve Jobs: Hello my name is Steve."

datetime_str = re.search(r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', line).group()
datetime_obj = datetime.strptime(datetime_str, '%m/%d/%Y, %I:%M\xa0%p')

print(datetime_obj) # This will output '2022-12-07 23:14:00'

Here’s what’s happening in the code:

Import the re module and the datetime class from the datetime module.
Define a string variable called line that contains the string to extract the datetime from.
Use the re.search() method to search for a pattern in the string that matches a datetime format. The pattern is r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', which matches a date in the format of “MM/DD/YYYY” followed by a comma, then a space, then an hour in either 12-hour format followed by a colon and minutes, then either “am” or “pm”.
Use the .group() method to extract the matched pattern from the string and assign it to a variable called datetime_str.
Use the strptime() method to convert the extracted string into a datetime object. The first parameter is the string to convert, and the second parameter is a format string that specifies how to parse the string. In this case %m/%d/%Y, %I:%M\xa0%p is used, which matches the format of the extracted string.

The above will only work if the date is in the “12/07/2022, 11:14 pm” format in the TXT file, which I’ve read not all exports do (perhaps an iOS/Android thing). Using similar regex the author and content can be extracted:

import re

line = "12/07/2022, 11:14 pm - Steve Jobs: Hello my name is Steve."

author = re.search(r'^\d{2}\/\d{2}\/\d{4},\s\d\d?:\d\d.(?:am|pm).-.(.*?):', line).group(1)
print(author) # This will output 'Steve Jobs'

start = re.search(author, line).span()[1]
content = re.search(r'(?<=:\s).*$', line[start:]).group().strip()
print(content) # This will output 'Hello my name is Steve.'

The author value is extracted by simply looking for the text that comes after the datetime value and before the colon :, using a method similar to finding the datetime described above.

To extract the content, the author is first searched for it in the line string using the re.search() method. The .span() method returns a tuple containing the start and end positions of the match. The end position of the match, which represents the index of the character immediately following the author’s name, is obtained using [1]. This value is then assigned to a variable called start. Next, a pattern is searched for in the line string that matches any characters appearing after a colon and space (: ). The positive lookbehind assertion, (?<=:\s), is utilized to ensure that the text to be matched is preceded by a colon and space. The pattern .*$ is used to match any number of characters until the end of the string. The matched text is extracted from the string using .group() and assigned to a variable called content. Finally, any leading or trailing whitespace is removed from the matched text using .strip().

Handling Newline Characters

Newline characters are a common nuisance for data file parsing, and the WhatsApp TXT file is no exception. Chat messages containing multiple lines will appear on multiple lines in the exported file like so:

12/07/2022, 11:14 pm - Steve Jobs: This is first message on single line.
12/07/2022, 11:32 pm - Bill Gates: Start of second message
Rest of second message.
12/07/2022, 11:14 pm - Steve Jobs: final message.

To capture the entire contents of a single message, it must be known that the Rest of second message line is part of the message sent in the line above. The datetime regex is reused for this, as all lines which start with the datetime are the start of a new message:

def _is_start_of_new_message(line):
    """All lines starting with a datetime are considered the beginning of a new message."""
    if re.match(r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', line):
        return True
    return False

This function will return True if the line being parsed is the start of a new message or not, which is needed when parsing the entire file.

Parsing the Entire File

Now that individual fields can be extracted, these processes can be consolidated within a class. This class will iterate through each line of the file, extract the values, and store them to be outputted to a pandas DataFrame:

import re
import os
import copy
import pandas as pd

from dateutil.parser import parse
from datetime import datetime, timedelta

class WhatsAppParser:
    def __init__(self, file_path):
        self._header = []
        self._messages = self._get_messages_from_file(file_path)

    def _get_absolute_file_path(self, file_path):
        """Return the absolute path for the target .txt file."""
        return os.path.abspath(file_path)

    def _get_messages_from_file(self, file_path):
        """Return a DataFrame of cleaned WhatsApp messages.

        Args:
            file_path (str): Path to file.

        Returns:
            DataFrame: DataFrame of WhatsApp messages with fields: datetime, author, content.
        """
        messages = []
        path_to_file = self._get_absolute_file_path(file_path)

        # Can init previous date time with now, it doesn't matter too much.
        previous_datetime = datetime.now()

        with open(path_to_file, encoding="utf8") as file:

            for row_number, line in enumerate(file, 1):

                if self._is_start_of_new_message(line):
                    message = self._construct_message(line, row_number)
                    if message is not None:

                        # Increment second value to ensure order for messages sent during same minute.
                        if message['datetime'].minute == previous_datetime.minute:
                            message['datetime'] = message['datetime'] + timedelta(seconds=1+previous_datetime.second)

                        previous_datetime = message['datetime']

                        messages.append(message)
                elif any(messages):
                    messages[-1]['content'] += f' {line.strip()}'
                else:
                    self._header.append(line.strip())

        return messages

    def _construct_message(self, line, row_number):
        """Fetch data from each line inside the file and returns a dict."""
        message = None

        try:
            datetime = self._get_datetime_from_line(line)
            author = self._get_author_from_line(line)
            content = self._get_content_from_line(line, author)
            message = {'datetime': datetime, 'author': author, 'content': content}
        except:
            print(f'Failed to extract data for line number [{row_number}]. Skipping...')

        return message

    def to_dataframe(self):
        """Convert the WhatsAppParser object into a pandas dataframe."""
        return pd.DataFrame(self._messages)

    @staticmethod
    def _is_start_of_new_message(line):
        """All lines starting with a datetime are considered the beginning of a new message."""
        if re.match(r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', line):
            return True
        return False

    @staticmethod
    def _get_datetime_from_line(line):
        """Extract datetime data from a line."""
        datetime_str = re.search(r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', line).group()

        # 12/07/2022, 11:16 pm
        format = '%d/%m/%Y, %I:%M %p'
        datetime_obj = datetime.strptime(datetime_str, format)

        return datetime_obj

    @staticmethod
    def _get_author_from_line(line):
        """Extract author data from a line."""
        author = re.search(r'^\d{2}\/\d{2}\/\d{4},\s\d\d?:\d\d.(?:am|pm).-.(.*?):', line).group(1)
        return author

    @staticmethod
    def _get_content_from_line(line, author):
        """Extract content data from a line."""
        start = re.search(author, line).span()[1]
        content = re.search(r'(?<=:\s).*$', line[start:]).group().strip()
        return content

The below shows how this class would be used to create a DataFrame of WhatsApp data:

whatsapp_txt_file = 'path-to-whatsapp-file/whatsapp-export.txt'
messages_df = WhatsAppParser(whatsapp_txt_file).to_dataframe()

There are two aspects in the full class worth highlighting:

When the line being parsed is not the start of a new line, after self._is_start_of_new_message(line) has returned false. When this happens, the entire content of the line is simply added to the previous record with messages[-1]['content'] += f' {line.strip()}'. After that it proceeds to the next line in the file.
When multiple chat messages are sent within the same minute. The TXT export file does not include a seconds value, which means that any messages sent within the same minute will appear to have been sent at exactly the same time. This lack of distinction can lead to issues when trying to order the messages accurately. As the lines in the exported file are listed chronologically, The seconds value can be incremented with the following message['datetime'] = message['datetime'] + timedelta(seconds=1+previous_datetime.second). The seconds value will reset to zero once a new message has been sent in a different minute window. Milliseconds could have been used to allow for more than 60 messages to be sent in a single minute, which doesn’t tend to happen in my chats.

Uploading to BigQuery

In the previous section, I created a DataFrame object containing the chat data which is likely sufficient for some to start analysing the data or export it into a compatible format like CSV to be manually uploaded into BigQuery. If, like me, you scoff at the idea of doing anything manually if it can be automated, then read on while I proceed with uploading this DataFrame into BigQuery using Python.

Using Google APIs

Shown below is a function that utilises the bigquery module from the google.cloud package. This function enables the uploading of the messages_df DataFrame into a table with a name specified by the table_name parameter. Note the placeholders for the GCP Project ID and BigQuery Dataset ID, which should be replaced accordingly:

from google.cloud import bigquery

def upload_to_bigquery(messages_df, table_name):
    """Upload the DataFrame to BigQuery.

    Args:
        messages_df (pandas.DataFrame): DataFrame of WhatsApp messages.
        table_name (str): Name of BigQuery table for the WhatsApp messages.
    """
    # Construct a BigQuery client object
    project_id = '[YOUR-GCP-PROJECT-ID]'
    client = bigquery.Client(project=project_id)

    # Create table name
    dataset_name = '[YOUR-DATASET-ID]'
    table_id = f'{project_id}.{dataset_name}.{table_name}'

    # Create the load job config
    job_config = bigquery.LoadJobConfig(
        schema=[
            bigquery.SchemaField("datetime", bigquery.enums.SqlTypeNames.DATETIME),
            bigquery.SchemaField("author", bigquery.enums.SqlTypeNames.STRING),
            bigquery.SchemaField("content", bigquery.enums.SqlTypeNames.STRING),
        ],
        write_disposition="WRITE_APPEND",
        create_disposition="CREATE_IF_NEEDED"
    )

    # Get number of rows currently added to the table, if any.
    try:
        table = client.get_table(table_id)
        current_row_count = table.num_rows
    except:
        print(f'Ran into error when fetching table size for [{table_id}]. Assuming it does not exist yet.')
        current_row_count = 0

    # Make an API request
    job = client.load_table_from_dataframe(messages_df, table_id, job_config=job_config)
    job.result()

    # Get number of rows added to the table
    table = client.get_table(table_id)
    after_upload_row_count = table.num_rows
    print(f"Added [{after_upload_row_count-current_row_count}] rows to table [{table_id}]")

This function takes two arguments: a Pandas DataFrame called messages_df that contains WhatsApp messages, and a string called table_name that specifies the name of the BigQuery table to upload the messages to. A BigQuery client object is created and a string called table_id that specifies the full ID of the BigQuery table to upload the messages to. A load job configuration object is then created that specifies the schema of the data being uploaded and how to handle conflicts with existing data in the table.

The next few lines attempt to get the number of rows currently in the BigQuery table specified by table_id. If an error occurs (e.g., if the table doesn’t exist yet), it prints an error message and assumes that there are no rows in the table. Lastly a load job is created that uploads the data in messages_df to the BigQuery table specified by table_id, followed by fetching the number of rows in the BigQuery table after uploading the data to print out how many rows were added.

Enriching and Deduplicating the data

If this logic were to be run as it is then each time a new WhatsApp file was imported it would append the data to the table, resulting in a lot of duplicated rows. The table data could simply be overwritten, but these exports have a limit of the most recent 40,000 messages so overwriting any messages that potentially an not be recovered is best to be avoided.

The solution I came up with was to simply do a DISTINCT * on the table after an upload, which is done at the end of the upload_to_bigquery function above:

# Create query that will remove any duplicate rows from table and add row sequence.
query = f'''
    SELECT
        DISTINCT *,
        ROW_NUMBER() OVER (PARTITION BY DATE(datetime) ORDER BY datetime ASC) AS day_sequence_num
    FROM
        `{table_id}`'''

job_config = bigquery.QueryJobConfig(destination=table_id, write_disposition="WRITE_TRUNCATE")

# Make an API request
query_job = client.query(query, job_config=job_config)
query_job.result()

# Get number of rows removed during deduplication process
table = client.get_table(table_id)
after_distinct_row_count = table.num_rows
print(f"Removed [{after_upload_row_count-after_distinct_row_count}] duplicate rows leaving [{after_distinct_row_count}] unique rows in table.")

I’m sure there is a more optimised and scalable solution for this type of problem, but for my purposes it’ll do.

There is also an additional field which I added to make some later data analysis easier, which is a day_sequence_num field. This is the sequence of the message for the day it was sent, allowing me to quickly know what was the first message sent. Again I could likely optimise this but found it to be the quickest solution when working with my visualisation tool.

Parsing the filename

Lastly, there was one final bit of automation I wrote to help with exporting multiple chat logs in their own tables, which was to read the name of the raw export file. The name of the exported file is always something like this:

WhatsApp Chat with [NAME].txt

Below is some logic which reads that [NAME] value and outputs it to the following format (name_all_messages):

def process_chat_file_name(file_name:str) -> str:
    """Parse the file name, returning a suitable table name if successful.

    Args:
        file_name (str): The raw file name of the export.

    Returns:
        str: Formatted table name.
    """
    table_name = ''

    # Check if the file name matches the required format
    pattern = r'^WhatsApp Chat with ([a-zA-Z]+) ([a-zA-Z]+)\.txt$'
    match = re.match(pattern, file_name)
    if not match:
        print(f'Invalid file name format: {file_name}')
        return table_name
    
    # Extract the first and last name from the file name
    first_name = match.group(1)
    last_name = match.group(2)
    
    # Generate the output string
    table_name = f'{first_name.lower()}_{last_name.lower()}_all_messages'
    return table_name

Create a Cloud Function

The automation doesn’t stop there! I wanted to simply be able to add the new exported file to a Google Cloud Storage bucket and have it do all of the python logic described so far. For this, I created a Cloud Function which triggers when new files are added to a bucket.

This would be the function that is first called for this event:

import os
import re
import tempfile
from google.cloud import bigquery, storage
from whatsapp_parser import WhatsAppParser

def process_whatsapp_file(data, context):
    """Triggered by a change to a Cloud Storage bucket.

    Args:
        data (dict): The Cloud Functions event payload.
        context (google.cloud.functions.Context): Metadata of triggering event.
    """
    # Extract bucket and file details from event message
    bucket_name = data['bucket']
    file_name = data['name']

    table_name = process_chat_file_name(file_name)
    if table_name == '':
        return

    # Create a Cloud Storage client object
    storage_client = storage.Client()

    # Download file to local temp file
    _, temp_local_filename = tempfile.mkstemp()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)
    blob.download_to_filename(temp_local_filename)

    # Parse file with WhatsAppParser
    messages_df = WhatsAppParser(temp_local_filename).to_dataframe()

    # Upload to BigQuery
    upload_to_bigquery(messages_df, table_name)

    # Delete local temp file
    os.remove(temp_local_filename)

The above script first imports necessary modules: os, re, tempfile, google.cloud.bigquery, google.cloud.storage, and a custom module called whatsapp_parser which is the WhatsAppParser class detailed in the first section. The function process_whatsapp_file is what will be executed when the Cloud Storage bucket is changed. This function takes two arguments: data (a dictionary representing the event payload) and context (metadata of the triggering event).

The function extracts relevant details from the event message, such as the bucket name, file name, and infers a table name based on the chat file name. It then creates a client object for Google Cloud Storage, which is used to download the file from the Cloud Storage bucket to a temporary local file using a generated unique filename.

The WhatsAppParser module is then used to parse the downloaded file and convert it into a DataFrame object named messages_df. The messages_df DataFrame is then uploaded to BigQuery using the upload_to_bigquery function detailed earlier. After the DataFrame has been successfully uploaded, the script deletes the local temporary file to clean up the disk space. Cloud Functions require a requirements.txt file to know which additional packages to install, which can be found along with the complete Cloud Function in my GitHub.

Some Basic Metrics

So I finally have the WhatsApp data in BigQuery after parsing the raw exported file and uploading it with Python. I can now query this data and answer some basic questions one might have about their chat logs (note the placeholder for the BigQuery Table ID):

SELECT
  author,
  COUNT(content) AS total_messages,
  SUM(ARRAY_LENGTH(SPLIT(content, ' '))) as total_words,
  ROUND(AVG(ARRAY_LENGTH(SPLIT(content, ' '))),4) as average_words_per_message,
  MAX(ARRAY_LENGTH(SPLIT(content, ' '))) as most_words_in_single_message,
  MAX(LENGTH(content)) as longest_single_message,
FROM
  `[YOUR-GCP-PROJECT-ID].[YOUR-DATASET-ID].[YOUR-TABLE-ID]`
GROUP BY
  1

I’ll assume you already have an understanding of SQL so will just explain the columns returned by this query:

author: It selects the author column from the table, which is then used in the GROUP BY clause so that the calculations in the SELECT clause will be performed separately for each unique author in the table.
COUNT(content) AS total_messages: It calculates the total number of messages for each author.
SUM(ARRAY_LENGTH(SPLIT(content, ' '))) as total_words: It calculates the total number of words in all the messages for each author. It splits the content column by spaces, counts the number of resulting elements (words), and sums them up.
ROUND(AVG(ARRAY_LENGTH(SPLIT(content, ' '))),4) as average_words_per_message: It calculates the average number of words per message for each author. It follows the same logic as above but uses the AVG function to calculate the average and rounds it to four decimal places.
MAX(ARRAY_LENGTH(SPLIT(content, ' '))) as most_words_in_single_message: It calculates the maximum number of words in a single message for each author. It again splits the content column by spaces, counts the words in each message, and selects the maximum value.
MAX(LENGTH(content)) as longest_single_message: It calculates the length of the longest single message (in terms of characters) for each author.

How about emoji usage? Here’s the query for knowing the frequency of emojis used (note the placeholders for the author values in the table and the BigQuery Table ID):

SELECT
  emoji,
  COUNT(*) AS frequency,
  COUNT(CASE WHEN author = "[AUTHOR ONE]" THEN emoji END) AS author_one_frequency,
  COUNT(CASE WHEN author = "[AUTHOR TWO]" THEN emoji END) AS author_two_frequency
FROM (
  SELECT
    author,
    REGEXP_EXTRACT_ALL(content, r'[\x{1F600}-\x{1F64F}]|[\x{2700}-\x{27BF}]|[\x{1F900}-\x{1F9FF}]|[\x{1F680}-\x{1F6FF}]|[\x{1F1E0}-\x{1F1FF}]|[:][\)\(PpoO\\\/DdSs]|<3') AS emojis
  FROM
    `[YOUR-GCP-PROJECT-ID].[YOUR-DATASET-ID].[YOUR-TABLE-ID]`)
CROSS JOIN
  UNNEST(emojis) AS emoji
GROUP BY
  emoji
ORDER BY
  2 DESC

As this query has a few different parts to it, here’s a more detailed explanation on what is happening:

The query starts with the SELECT clause, which specifies the columns to be included in the result.
- emoji: It selects the emoji column.
- COUNT(*) AS frequency: It calculates the frequency of each emoji occurrence.
- COUNT(CASE WHEN author = "[AUTHOR ONE]" THEN emoji END) AS author_one_frequency: It calculates the frequency of each emoji occurrence specifically for [AUTHOR ONE].
- COUNT(CASE WHEN author = "[AUTHOR TWO]" THEN emoji END) AS author_two_frequency: It calculates the frequency of each emoji occurrence specifically for [AUTHOR TWO].
The FROM clause contains a subquery. The subquery selects the author column and extracts all occurrences of emojis from the content column using a regular expression pattern. The extracted emojis are stored in an array called emojis.
The outer query performs a CROSS JOIN with the UNNEST function on the emojis array, which essentially flattens the array into individual rows. This creates a row for each occurrence of an emoji.
The GROUP BY clause groups the result set by the emoji column, so all occurrences of the same emoji are grouped together.
The ORDER BY clause sorts the result set in descending order based on the second column (frequency) to show the emojis with the highest frequency first.

In summary, the query retrieves information about the frequency of emojis in a specified table. It calculates the overall frequency and frequency for specific authors, sorts the emojis based on frequency in descending order, and presents the result.