Implementing Further filtering
This part 2 of my series, the first part can be read here: Leveraging Social Media for Operational Threat Intelligence
Now that you have the ability to connect to twitter and directly pull down tweets, we want to start building out the functionality that will identify which of these tweets are the most relevant to us. Keep in mind that what we’re working on is actually a second layer of filtering, we set up the first filter when we first opened up the streamer, which means Twitter will only send us tweets based on the terms or users we provided.
Now we want to see if any of the specific statuses are pertinent to us. There’s multiple ways that we can do this, what we’ll do is look at specific fields within the tweets themselves and see if there’s a value in there that we care about. To get a full list of all the possible fields associated with the tweets, you can find all of them on the Twitter development website.
As you can see tweets consist of various different fields, for a second tier of filtering, we’re interested in looking at domains and twitter account mentions in the tweet. Each of these will require that we look at different fields in the status and use it to compare to a list of another criteria and if that criteria are met, we want some action to occur. The specifics of what the action will occur, aka what will happen when a test comes up true will be discussed in part 3 of this series, where we’ll roll out a way for notifications to be sent. At this specific step we’ll be developing the logic tests to make the determination whether or not an action should be taken.
We will contain our logic tests in simple functions whose sole purpose is to return a TRUE or FLASE dependent on whether or not the test meets a criterion. Since we’re looking at two unique tests, we’ll be writing two unique functions which will then be called by the def on-status method
The values that will be tested against will be stored as part of the Listener class, so we will also be adding functions that will handle the loading of this data. For our super simple example, we’ll just assume that the data is found in a text file which we can simply read from. In your case, you may already have a database that contains that information that you could theoretically pull from. But since we’re looking for a simple proof of concept we’re just going to go with the simplest example.
Loading the values
To see if a value exists in a status, you’ll first need to load it…obviously. In this case we’re assuming that you have the following information in the following formats. You’re going to need the following:
- a text file of “good guy” twitter accounts, these are the twitter accounts that might be associated with your organization + potentially your partners.
- We’ll say this is stored in a simple text file
- A text file of domains you want to monitor
- We’ll say this is stored in a simple text file
To make our directory a little bit cleaner, I’ll be creating a ‘data’ folder that I’ll be inputting all of these files.
We will want to load these values on the start of the Listener class, so we’ll add the call of our function into the __init__ method.
Our two loading files follow the same process (these probably could combined into one function, but screw it I don’t mind doing the occasional ctrl-c ctrl-v style programming.
The domain loader function will load the domains from the domain.txt file which will be included in our example file, while the twitter account loader will do the same for the twitter accounts we want to identify for mentions. The mentioned twitter accounts are different than the bad guy accounts since we want to identify when they’re being mentioned, not necessarily when they themselves are tweeting.
Working through the domain comparison. For our case we’re interested in ONLY the domain since our list consists of only domains. To allow us to compare apples to apples, we need to find a way that we can extract only the domains, regardless of the protocol, and regardless of which resource is being associated[URI]. This is where we introduce our second dependency, urlparse, this great library allows you to extract different components of a URL by parsing them into an object in which you can then call upon the different components of the URL.
On gist below, you can see us calling the function urlparse to the URL object found in the mentioned piece of the status object. Once it’s parse, we will then extract just the domain by calling the netloc method. After that, we got to convert that back to a string, make it lowercase (cause remember we did the same thing with our domain list, lowercase them all, so we can compare apples to apples) and lastly we ask if “in” our current array of domains. This means that the results array will contain a list of a TRUE or FALSE, telling us if any of the mentioned URLs are one of ours. At the end we don’t want to know how many of our domains are mentioned, we just want to know IF any of our domains are found. So lastly, we call the “any” function on the results array to identify if any of our results within the array are “True”.
To check to see if the our twitter accounts are in the mentions, we’ll want to go through the same essential process. Go through the User-mention dictionary + cycle through the user screen names and see if they’re in our list.
Lastly, we want to define a super simple action for us to take if the tests turn up true, in our case, we’ll just print that a hit was found. Here’s mine:
And there you have it, you have a way to check the tweet statuses based on two simple criteria. Feel free to try other ones and build off this simple foundation. Stay tuned for the next one, as we’ll go into detail in terms of how to create notification emails based on the hits.