For the backstory on how this project was birthed, check out this post here.

TL;DR: I really wanted to know how I could best quantify the quality of my text message conversations and trend it over time to detect changing relationship dynamics.

...I hoped (for both my curiosity and my rapidly deteriorating sense of normalcy) that I wasn't the only person in the world who had ever wanted to dump all of their text message conversations into .txt files.

Somebody with equally questionable intentions had built a tool for Macs that finds the chat.db file; which contains all of your conversations; provided that your iPhone is synced with your Mac. I was in luck, as a budding Apple Fanboy, my texts were definitely in that file. I ran the script and after a few minutes of bash script magic, it was finished.

It's lit, after running the script I had whole conversations on deck as simple .txt files. Lots to work with here.

Here's a quick excerpt showing how they were formatted.

Friend: wow i heard you have a new blog|2018-02-07 23:18:12
Me: yerppp, ye alweddy know boii πŸ’ͺ🏿 |2018-02-07 23:18:16
Me: Im actually boutta write an article about analyzing texts messages πŸ€“|2018-02-07 23:18:18
Friend: shidd, that sounds incredible |2018-02-07 23:19:52
Friend: can i pleeaaaseeeeee be featured?! πŸ™πŸΏ|2018-02-07 23:19:58
Me: Eh, I'll see what i can do, but no promises bruv|2018-02-07 23:20:56
Friend: thanks, i owe you!!|2018-02-07 23:21:17
Me: Be blessed beloved|2018-02-07 23:21:24

So the important things on first glance:

  1. Streets are hungry for the blog, no doubt.
  2. Date time format after every message: YYYY-MM-DD HH:MM:SS.
  3. Emojis are rendering, but what will they look in text they are actually read from a file? Some weird unicode/ascii black magic bullshit, undoubtedly.
  4. Sender of each message is identified before the first colon.

So for the date format on every message need some date parsing stuff, no problems there. Python has a few good libraries that make this a one-liner, I chose to use my favorite, dateutil.parser . For emoji rendering we'll need some clever lookup functions for determining emojis from their raw unicode representation.

After a few test runs I realized something worrisome. Some people send hella texts only a few seconds apart. This would skew all of my counting metrics for people with this particular texting habit. To handle this we use a TextEquivalent class that represents multiple rapid fire texts as a single entity to provide some standardization across different texting styles.

Text Equivalent Class

Next up, actually processing the giant text file that our handy open source bash script dumped.

Regular expressions and lots of string slicing to get rid of special characters and what not. The most interesting thing here is deciding when to use that merge_sequential_text_equiv() method from the TextEquivalent class described earlier by looking at the time difference between two texts.

Reading in the raw text and creating a series of TextEquivalent objects.

Alright cool, next up was actually analyzing the content of the text. The big fun part.

πŸ”₯πŸ€”Burning QuestionsπŸ€”πŸ”₯
  1. Who was double texting more? i.e. Starting all the conversations after being the last person to speak in the preceding conversation.
  2. How long did the other person keep the other waiting for a response?
  3. Who was laughing more often?
  4. Cursing more often?
  5. Sending and sharing more links and articles?
  6. What about emojis? Who sends them more and which ones were most often used?
  7. What are my favorite curse words?
  8. What's my favorite way to express laughter?

There's a ton of repetitive code in the repository that you're more than welcome to look at, but in general most of these questions can be answered with regular expressions and keeping track of the timestamps associated with each text; luckily the TextEquivalent class was designed to make that pretty simple.

Combined a few answers from stackoverflow and various forums and came out with this ghastly expression.

re.search(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',te.all_text.lower())

Finding Emojis

Okay this was really terrible, awful, and generally unenjoyable. Apparently in python 3+ the regular expression for finding emojis is:

re.findall(r'[\U0001d300-\U0001d356]',te.all_text.lower())

In Python 2 however... behold, lucifer himself.

re.search(ur'(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)',te.all_text.lower())

There was also a fair amount of acrobatics needed to also determine when emojis with skin tones were included. I went through a lot of extra trouble to figure this one out because representation.

The tricky part about detecting skin tones with emojis is that in unicode they are represented as a concatenation of the skin tone and the regular emoji. But the regular expression above detects any type of emoji. So after detecting an emoji you had to check if there was a skin tone emoji code immediately preceding it.

Skin tone emoji unicode values:

u'\U0001f3fb',u'\U0001f3fc',u'\U0001f3fd',u'\U0001f3fe',u'\U0001f3ff'

For anyone dying to know more about emojis, I have another post that goes into some more depth about how they work and some of their fun quirks including why emoji skin tones are the way they are!

After finding the unicode representation of the emojis I webscraped the official unicode page to match up the unicode points up with human-readable names such as "grinning face with sweat" for πŸ˜… and "woman technologist: dark skin tone" for πŸ‘©πŸΏβ€πŸ’».

Finding Curse Words

I like to think I am a more prolific/creative curser than this regular expression lets on, but I may have been getting lazy. Pretty self explanatory what's going on here.

re.search(r'\b([s]+[h]+[i]+[t]+|[f]+[u]+[c]+[k]+|[b]+[i]+[t]+[c]+[h]+)\b',te.all_text.lower())

Finding Laughter

This one catches a lot of different variations in the way that people laugh, or pretend to laugh, via text. You may be able to just visually inspect the regular expression and find out which ones, but i'll list some here.

  • lol
  • lmao
  • haha
  • lololol
  • aahaahaa
  • lmaoooo (i know you got that one friend who is somehow always laughing their ass off off off off)

re.search(r'\b(a*ha+h[ha]*|o?l+o+l+[ol]*|lma[o]+)\b',te.all_text.lower())

Regular Expression Playground & Proof of Concept

If you want to try any of the regular expressions explained above against some words you might find in your own messages, give it a whirl here by pressing the green play button.

Double Texts

For calculating the time between texts it was a simple subtraction of the timestamp that was parsed during the TextEquivalent instantiation. In order to determine if a double text has occurred there are a few steps:

  1. calculate the median response rate of all the texts from a given sender.
  2. check if two consecutive texts are from the same person.
  3. If they are, check that the time difference between the two is at least 3.5 times the median wait time calculated in step 1.

If the criteria in steps 2 and 3 are both met, then a double text is logged.

Top X Occurrences

One of the other features included in the program is determining the most common occurrences of certain things. For instance, the top 10 emojis used, the top 5 curse words used, or the top 3 laughter types. As shown in the double texting section, everything is processed as a list of TextEquivalent objects, and at the end of the processing each metric has a list of dictionaries returned in the following format.

curses = [{'curse_used': ['shit'], 'curse_bool': True, 'day of week': 5, 'hour': 23}, {'curse_bool': False, 'day of week': 5, 'hour': 23}, {'curse_bool': False, 'day of week': 6, 'hour': 2}, {'curse_bool': False, 'day of week': 6, 'hour': 3}]

links = [{'day of week': 4, 'link_bool': False, 'hour': 3}, {'day of week': 4, 'link_bool': False, 'hour': 3}, {'day of week': 4, 'link_bool': False, 'hour': 3}, {'link_used': ['http://verysmartbrothas.com/the-10-most-dangerous-types-of-supposedly-good-white-people/'], 'day of week': 4, 'link_bool': True, 'hour': 3}]

laughs = [{'laugh_bool': True, 'laugh_used': ['hahahaha'], 'day of week': 4, 'hour': 1}, {'laugh_bool': False, 'day of week': 4, 'hour': 1}, {'laugh_bool': False, 'day of week': 4, 'hour': 1}, {'laugh_bool': True, 'laugh_used': ['lol'], 'day of week': 4, 'hour': 1}]

In addition to the booleans and the actual regex text match, the result of each process also stores the day of the week, and the hour of the day of the text message. This extra information is used in analysis and allows us to break down text characteristics both by the time of day and the day of week.

To characterize the top x occurrences of any of these metrics the following logic was implemented.

Visualization

For visualization all the plots were made in my favorite graphing library of all time (and its not close). It has a bit of a learning curve, but it makes graphs really pretty in my opinion and takes matplotlib's lunch money.

The types of visualizations of the most interest to me were trending over time and breaking patterns down by day of the week and hour of the day. Given the data structures shown in previous sections it wasn't too much of a leap to make the plots to visualize. I'll make note of the few interesting parts:

  1. Time zones and date time formats: before I used plotly, I used matplotlib and it handled dates different than most. Instead of unix time which most people are used to (number of seconds since 01-01-1970 midnight UTC), matplotlib uses number of a days since 01-01-0001 midnight UTC. Just annoying. Even with plotly, all of the timestamps had to be changed from timezone naive to timezone aware date objects. I'm done talking about this actually, I'm getting upset again.
  2. Subsets of texts: Because the TextEquivalent object stored metadata about the text messages such as timestamps, it wasn't terribly complicated to write functions to segment texts by time. For instance doing visualizations only for texts that happened on Thursdays or during the 22nd hour of the day or in the month of December or between March 4th and May 3rd.
  3. Redundant code: I unfortunately was not thinking very hard about architecture and design when I started this project and as a result some of the plotting is a bunch of copying and pasting of strings with various substitutions.

Here are some visualization reproduced from the backstory post

Source Code Available Here

Thanks for getting through this post, if you have any questions or suggestions, scroll down hollr at the kid.

Best,

πŸ‘¨πŸΏβ€πŸ’»