WhatsApp is one of the most popular messaging apps in the world, with over 2 billion active users. It’s no surprise that many people want to analyse their WhatsApp data to gain insights into their conversations, including myself! In this post, I’ll detail how I used Python and BigQuery to analyze my WhatsApp data, covering everything from parsing the raw TXT export to visualizing the data. Let’s get started!
Parsing the TXT Export
Within WhatsApp it is possible to export chat data to a TXT file, by opening the options within individual or group chats, tapping “More” then “Export chat”. In order to take this file and upload its contents to BigQuery, I had to parse it into a format supported by BigQuery for uploads. Here is how the chat data appears in the TXT file:
12/07/2022, 11:14 pm - Steve Jobs: Hello my name is Steve.
12/07/2022, 11:17 pm - Bill Gates: Hi Steve, it's nice to meet you.
In this file there is the date, the time, the name of the message sender and the content of the message. With this the three key fields for BigQuery can be created: datetime, author, content. There is some formatting in the TXT file to help parse each row into these fields which is described below. All of the initial parsing and cleaning of the exported file will be done with Python.
Extracting each field from a line of data
Extracting the datetime from a line of chat log data was done in the following way:
import re
from datetime import datetime
line = "12/07/2022, 11:14 pm - Steve Jobs: Hello my name is Steve."
datetime_str = re.search(r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', line).group()
datetime_obj = datetime.strptime(datetime_str, '%m/%d/%Y, %I:%M\xa0%p')
print(datetime_obj) # This will output '2022-12-07 23:14:00'
Here’s what’s happening in the code:
- Import the
re
module and thedatetime
class from thedatetime
module. - Define a string variable called
line
that contains the string to extract the datetime from. - Use the
re.search()
method to search for a pattern in the string that matches a datetime format. The pattern isr'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)'
, which matches a date in the format of “MM/DD/YYYY” followed by a comma, then a space, then an hour in either 12-hour format followed by a colon and minutes, then either “am” or “pm”. - Use the
.group()
method to extract the matched pattern from the string and assign it to a variable calleddatetime_str
. - Use the
strptime()
method to convert the extracted string into a datetime object. The first parameter is the string to convert, and the second parameter is a format string that specifies how to parse the string. In this case%m/%d/%Y, %I:%M\xa0%p
is used, which matches the format of the extracted string.
The above will only work if the date is in the “12/07/2022, 11:14 pm” format in the TXT file, which I’ve read not all exports do (perhaps an iOS/Android thing). Using similar regex the author and content can be extracted:
import re
line = "12/07/2022, 11:14 pm - Steve Jobs: Hello my name is Steve."
author = re.search(r'^\d{2}\/\d{2}\/\d{4},\s\d\d?:\d\d.(?:am|pm).-.(.*?):', line).group(1)
print(author) # This will output 'Steve Jobs'
start = re.search(author, line).span()[1]
content = re.search(r'(?<=:\s).*$', line[start:]).group().strip()
print(content) # This will output 'Hello my name is Steve.'
The author value is extracted by simply looking for the text that comes after the datetime value and before the colon :
, using a method similar to finding the datetime described above.
To extract the content, the author is first searched for it in the line string using the re.search()
method. The .span()
method returns a tuple containing the start and end positions of the match. The end position of the match, which represents the index of the character immediately following the author’s name, is obtained using [1]
. This value is then assigned to a variable called start
. Next, a pattern is searched for in the line
string that matches any characters appearing after a colon and space (:
). The positive lookbehind assertion, (?<=:\s)
, is utilized to ensure that the text to be matched is preceded by a colon and space. The pattern .*$
is used to match any number of characters until the end of the string. The matched text is extracted from the string using .group()
and assigned to a variable called content
. Finally, any leading or trailing whitespace is removed from the matched text using .strip()
.
Handling Newline Characters
Newline characters are a common nuisance for data file parsing, and the WhatsApp TXT file is no exception. Chat messages containing multiple lines will appear on multiple lines in the exported file like so:
12/07/2022, 11:14 pm - Steve Jobs: This is first message on single line.
12/07/2022, 11:32 pm - Bill Gates: Start of second message
Rest of second message.
12/07/2022, 11:14 pm - Steve Jobs: final message.
To capture the entire contents of a single message, it must be known that the Rest of second message
line is part of the message sent in the line above. The datetime regex is reused for this, as all lines which start with the datetime are the start of a new message:
def _is_start_of_new_message(line):
"""All lines starting with a datetime are considered the beginning of a new message."""
if re.match(r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', line):
return True
return False
This function will return True
if the line being parsed is the start of a new message or not, which is needed when parsing the entire file.
Parsing the Entire File
Now that individual fields can be extracted, these processes can be consolidated within a class. This class will iterate through each line of the file, extract the values, and store them to be outputted to a pandas DataFrame:
import re
import os
import copy
import pandas as pd
from dateutil.parser import parse
from datetime import datetime, timedelta
class WhatsAppParser:
def __init__(self, file_path):
self._header = []
self._messages = self._get_messages_from_file(file_path)
def _get_absolute_file_path(self, file_path):
"""Return the absolute path for the target .txt file."""
return os.path.abspath(file_path)
def _get_messages_from_file(self, file_path):
"""Return a DataFrame of cleaned WhatsApp messages.
Args:
file_path (str): Path to file.
Returns:
DataFrame: DataFrame of WhatsApp messages with fields: datetime, author, content.
"""
messages = []
path_to_file = self._get_absolute_file_path(file_path)
# Can init previous date time with now, it doesn't matter too much.
previous_datetime = datetime.now()
with open(path_to_file, encoding="utf8") as file:
for row_number, line in enumerate(file, 1):
if self._is_start_of_new_message(line):
message = self._construct_message(line, row_number)
if message is not None:
# Increment second value to ensure order for messages sent during same minute.
if message['datetime'].minute == previous_datetime.minute:
message['datetime'] = message['datetime'] + timedelta(seconds=1+previous_datetime.second)
previous_datetime = message['datetime']
messages.append(message)
elif any(messages):
messages[-1]['content'] += f' {line.strip()}'
else:
self._header.append(line.strip())
return messages
def _construct_message(self, line, row_number):
"""Fetch data from each line inside the file and returns a dict."""
message = None
try:
datetime = self._get_datetime_from_line(line)
author = self._get_author_from_line(line)
content = self._get_content_from_line(line, author)
message = {'datetime': datetime, 'author': author, 'content': content}
except:
print(f'Failed to extract data for line number [{row_number}]. Skipping...')
return message
def to_dataframe(self):
"""Convert the WhatsAppParser object into a pandas dataframe."""
return pd.DataFrame(self._messages)
@staticmethod
def _is_start_of_new_message(line):
"""All lines starting with a datetime are considered the beginning of a new message."""
if re.match(r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', line):
return True
return False
@staticmethod
def _get_datetime_from_line(line):
"""Extract datetime data from a line."""
datetime_str = re.search(r'^(\d{2}\/\d{2}\/\d{4}),\s\d(?:\d)?:\d{2}.{1}(am|pm)', line).group()
# 12/07/2022, 11:16 pm
format = '%d/%m/%Y, %I:%M %p'
datetime_obj = datetime.strptime(datetime_str, format)
return datetime_obj
@staticmethod
def _get_author_from_line(line):
"""Extract author data from a line."""
author = re.search(r'^\d{2}\/\d{2}\/\d{4},\s\d\d?:\d\d.(?:am|pm).-.(.*?):', line).group(1)
return author
@staticmethod
def _get_content_from_line(line, author):
"""Extract content data from a line."""
start = re.search(author, line).span()[1]
content = re.search(r'(?<=:\s).*$', line[start:]).group().strip()
return content
The below shows how this class would be used to create a DataFrame of WhatsApp data:
whatsapp_txt_file = 'path-to-whatsapp-file/whatsapp-export.txt'
messages_df = WhatsAppParser(whatsapp_txt_file).to_dataframe()
There are two aspects in the full class worth highlighting:
-
When the line being parsed is not the start of a new line, after
self._is_start_of_new_message(line)
has returned false. When this happens, the entire content of the line is simply added to the previous record withmessages[-1]['content'] += f' {line.strip()}'
. After that it proceeds to the next line in the file. -
When multiple chat messages are sent within the same minute. The TXT export file does not include a seconds value, which means that any messages sent within the same minute will appear to have been sent at exactly the same time. This lack of distinction can lead to issues when trying to order the messages accurately. As the lines in the exported file are listed chronologically, The seconds value can be incremented with the following
message['datetime'] = message['datetime'] + timedelta(seconds=1+previous_datetime.second)
. The seconds value will reset to zero once a new message has been sent in a different minute window. Milliseconds could have been used to allow for more than 60 messages to be sent in a single minute, which doesn’t tend to happen in my chats.
Uploading to BigQuery
In the previous section, I created a DataFrame object containing the chat data which is likely sufficient for some to start analysing the data or export it into a compatible format like CSV to be manually uploaded into BigQuery. If, like me, you scoff at the idea of doing anything manually if it can be automated, then read on while I proceed with uploading this DataFrame into BigQuery using Python.
Using Google APIs
Shown below is a function that utilises the bigquery
module from the google.cloud
package. This function enables the uploading of the messages_df
DataFrame into a table with a name specified by the table_name
parameter. Note the placeholders for the GCP Project ID and BigQuery Dataset ID, which should be replaced accordingly:
from google.cloud import bigquery
def upload_to_bigquery(messages_df, table_name):
"""Upload the DataFrame to BigQuery.
Args:
messages_df (pandas.DataFrame): DataFrame of WhatsApp messages.
table_name (str): Name of BigQuery table for the WhatsApp messages.
"""
# Construct a BigQuery client object
project_id = '[YOUR-GCP-PROJECT-ID]'
client = bigquery.Client(project=project_id)
# Create table name
dataset_name = '[YOUR-DATASET-ID]'
table_id = f'{project_id}.{dataset_name}.{table_name}'
# Create the load job config
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("datetime", bigquery.enums.SqlTypeNames.DATETIME),
bigquery.SchemaField("author", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("content", bigquery.enums.SqlTypeNames.STRING),
],
write_disposition="WRITE_APPEND",
create_disposition="CREATE_IF_NEEDED"
)
# Get number of rows currently added to the table, if any.
try:
table = client.get_table(table_id)
current_row_count = table.num_rows
except:
print(f'Ran into error when fetching table size for [{table_id}]. Assuming it does not exist yet.')
current_row_count = 0
# Make an API request
job = client.load_table_from_dataframe(messages_df, table_id, job_config=job_config)
job.result()
# Get number of rows added to the table
table = client.get_table(table_id)
after_upload_row_count = table.num_rows
print(f"Added [{after_upload_row_count-current_row_count}] rows to table [{table_id}]")
This function takes two arguments: a Pandas DataFrame called messages_df
that contains WhatsApp messages, and a string called table_name
that specifies the name of the BigQuery table to upload the messages to. A BigQuery client object is created and a string called table_id
that specifies the full ID of the BigQuery table to upload the messages to. A load job configuration object is then created that specifies the schema of the data being uploaded and how to handle conflicts with existing data in the table.
The next few lines attempt to get the number of rows currently in the BigQuery table specified by table_id
. If an error occurs (e.g., if the table doesn’t exist yet), it prints an error message and assumes that there are no rows in the table. Lastly a load job is created that uploads the data in messages_df
to the BigQuery table specified by table_id
, followed by fetching the number of rows in the BigQuery table after uploading the data to print out how many rows were added.
Enriching and Deduplicating the data
If this logic were to be run as it is then each time a new WhatsApp file was imported it would append the data to the table, resulting in a lot of duplicated rows. The table data could simply be overwritten, but these exports have a limit of the most recent 40,000 messages so overwriting any messages that potentially an not be recovered is best to be avoided.
The solution I came up with was to simply do a DISTINCT *
on the table after an upload, which is done at the end of the upload_to_bigquery
function above:
# Create query that will remove any duplicate rows from table and add row sequence.
query = f'''
SELECT
DISTINCT *,
ROW_NUMBER() OVER (PARTITION BY DATE(datetime) ORDER BY datetime ASC) AS day_sequence_num
FROM
`{table_id}`'''
job_config = bigquery.QueryJobConfig(destination=table_id, write_disposition="WRITE_TRUNCATE")
# Make an API request
query_job = client.query(query, job_config=job_config)
query_job.result()
# Get number of rows removed during deduplication process
table = client.get_table(table_id)
after_distinct_row_count = table.num_rows
print(f"Removed [{after_upload_row_count-after_distinct_row_count}] duplicate rows leaving [{after_distinct_row_count}] unique rows in table.")
I’m sure there is a more optimised and scalable solution for this type of problem, but for my purposes it’ll do.
There is also an additional field which I added to make some later data analysis easier, which is a day_sequence_num
field. This is the sequence of the message for the day it was sent, allowing me to quickly know what was the first message sent. Again I could likely optimise this but found it to be the quickest solution when working with my visualisation tool.
Parsing the filename
Lastly, there was one final bit of automation I wrote to help with exporting multiple chat logs in their own tables, which was to read the name of the raw export file. The name of the exported file is always something like this:
WhatsApp Chat with [NAME].txt
Below is some logic which reads that [NAME]
value and outputs it to the following format (name_all_messages
):
def process_chat_file_name(file_name:str) -> str:
"""Parse the file name, returning a suitable table name if successful.
Args:
file_name (str): The raw file name of the export.
Returns:
str: Formatted table name.
"""
table_name = ''
# Check if the file name matches the required format
pattern = r'^WhatsApp Chat with ([a-zA-Z]+) ([a-zA-Z]+)\.txt$'
match = re.match(pattern, file_name)
if not match:
print(f'Invalid file name format: {file_name}')
return table_name
# Extract the first and last name from the file name
first_name = match.group(1)
last_name = match.group(2)
# Generate the output string
table_name = f'{first_name.lower()}_{last_name.lower()}_all_messages'
return table_name
Create a Cloud Function
The automation doesn’t stop there! I wanted to simply be able to add the new exported file to a Google Cloud Storage bucket and have it do all of the python logic described so far. For this, I created a Cloud Function which triggers when new files are added to a bucket.
This would be the function that is first called for this event:
import os
import re
import tempfile
from google.cloud import bigquery, storage
from whatsapp_parser import WhatsAppParser
def process_whatsapp_file(data, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
data (dict): The Cloud Functions event payload.
context (google.cloud.functions.Context): Metadata of triggering event.
"""
# Extract bucket and file details from event message
bucket_name = data['bucket']
file_name = data['name']
table_name = process_chat_file_name(file_name)
if table_name == '':
return
# Create a Cloud Storage client object
storage_client = storage.Client()
# Download file to local temp file
_, temp_local_filename = tempfile.mkstemp()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.download_to_filename(temp_local_filename)
# Parse file with WhatsAppParser
messages_df = WhatsAppParser(temp_local_filename).to_dataframe()
# Upload to BigQuery
upload_to_bigquery(messages_df, table_name)
# Delete local temp file
os.remove(temp_local_filename)
The above script first imports necessary modules: os, re, tempfile, google.cloud.bigquery, google.cloud.storage, and a custom module called whatsapp_parser which is the WhatsAppParser class detailed in the first section. The function process_whatsapp_file
is what will be executed when the Cloud Storage bucket is changed. This function takes two arguments: data
(a dictionary representing the event payload) and context
(metadata of the triggering event).
The function extracts relevant details from the event message, such as the bucket name, file name, and infers a table name based on the chat file name. It then creates a client object for Google Cloud Storage, which is used to download the file from the Cloud Storage bucket to a temporary local file using a generated unique filename.
The WhatsAppParser
module is then used to parse the downloaded file and convert it into a DataFrame object named messages_df. The messages_df
DataFrame is then uploaded to BigQuery using the upload_to_bigquery
function detailed earlier. After the DataFrame has been successfully uploaded, the script deletes the local temporary file to clean up the disk space. Cloud Functions require a requirements.txt
file to know which additional packages to install, which can be found along with the complete Cloud Function in my GitHub.
Some Basic Metrics
So I finally have the WhatsApp data in BigQuery after parsing the raw exported file and uploading it with Python. I can now query this data and answer some basic questions one might have about their chat logs (note the placeholder for the BigQuery Table ID):
SELECT
author,
COUNT(content) AS total_messages,
SUM(ARRAY_LENGTH(SPLIT(content, ' '))) as total_words,
ROUND(AVG(ARRAY_LENGTH(SPLIT(content, ' '))),4) as average_words_per_message,
MAX(ARRAY_LENGTH(SPLIT(content, ' '))) as most_words_in_single_message,
MAX(LENGTH(content)) as longest_single_message,
FROM
`[YOUR-GCP-PROJECT-ID].[YOUR-DATASET-ID].[YOUR-TABLE-ID]`
GROUP BY
1
I’ll assume you already have an understanding of SQL so will just explain the columns returned by this query:
author
: It selects theauthor
column from the table, which is then used in theGROUP BY
clause so that the calculations in theSELECT
clause will be performed separately for each uniqueauthor
in the table.COUNT(content) AS total_messages
: It calculates the total number of messages for eachauthor
.SUM(ARRAY_LENGTH(SPLIT(content, ' '))) as total_words
: It calculates the total number of words in all the messages for eachauthor
. It splits thecontent
column by spaces, counts the number of resulting elements (words), and sums them up.ROUND(AVG(ARRAY_LENGTH(SPLIT(content, ' '))),4) as average_words_per_message
: It calculates the average number of words per message for eachauthor
. It follows the same logic as above but uses theAVG
function to calculate the average and rounds it to four decimal places.MAX(ARRAY_LENGTH(SPLIT(content, ' '))) as most_words_in_single_message
: It calculates the maximum number of words in a single message for eachauthor
. It again splits thecontent
column by spaces, counts the words in each message, and selects the maximum value.MAX(LENGTH(content)) as longest_single_message
: It calculates the length of the longest single message (in terms of characters) for eachauthor
.
How about emoji usage? Here’s the query for knowing the frequency of emojis used (note the placeholders for the author values in the table and the BigQuery Table ID):
SELECT
emoji,
COUNT(*) AS frequency,
COUNT(CASE WHEN author = "[AUTHOR ONE]" THEN emoji END) AS author_one_frequency,
COUNT(CASE WHEN author = "[AUTHOR TWO]" THEN emoji END) AS author_two_frequency
FROM (
SELECT
author,
REGEXP_EXTRACT_ALL(content, r'[\x{1F600}-\x{1F64F}]|[\x{2700}-\x{27BF}]|[\x{1F900}-\x{1F9FF}]|[\x{1F680}-\x{1F6FF}]|[\x{1F1E0}-\x{1F1FF}]|[:][\)\(PpoO\\\/DdSs]|<3') AS emojis
FROM
`[YOUR-GCP-PROJECT-ID].[YOUR-DATASET-ID].[YOUR-TABLE-ID]`)
CROSS JOIN
UNNEST(emojis) AS emoji
GROUP BY
emoji
ORDER BY
2 DESC
As this query has a few different parts to it, here’s a more detailed explanation on what is happening:
-
The query starts with the
SELECT
clause, which specifies the columns to be included in the result.emoji
: It selects theemoji
column.COUNT(*) AS frequency
: It calculates the frequency of eachemoji
occurrence.COUNT(CASE WHEN author = "[AUTHOR ONE]" THEN emoji END) AS author_one_frequency
: It calculates the frequency of eachemoji
occurrence specifically for[AUTHOR ONE]
.COUNT(CASE WHEN author = "[AUTHOR TWO]" THEN emoji END) AS author_two_frequency
: It calculates the frequency of eachemoji
occurrence specifically for[AUTHOR TWO]
.
-
The
FROM
clause contains a subquery. The subquery selects theauthor
column and extracts all occurrences of emojis from thecontent
column using a regular expression pattern. The extracted emojis are stored in an array calledemojis
. -
The outer query performs a
CROSS JOIN
with theUNNEST
function on theemojis
array, which essentially flattens the array into individual rows. This creates a row for each occurrence of an emoji. -
The
GROUP BY
clause groups the result set by theemoji
column, so all occurrences of the same emoji are grouped together. -
The
ORDER BY
clause sorts the result set in descending order based on the second column (frequency) to show the emojis with the highest frequency first.
In summary, the query retrieves information about the frequency of emojis in a specified table. It calculates the overall frequency and frequency for specific authors, sorts the emojis based on frequency in descending order, and presents the result.