OSINT Data Sources: Trust but Verify

Software engineer analyzing source code with papers

Estimated Reading Time: 12 min

Thanks to @seamustuohy and @ginsberg5150 for editorial contributions

For new readers, welcome, and please take a moment to read a brief message From the Author.  This article’s primary audience is analysts however if you are in leadership and seek to optimize or maximize the analysis your threat intelligence program is producing you may find this walk through valuable.

OVERVIEW

In the recent blog Outlining a Threat Intel Program, https://www.osint.fail/2017/04/24/outlining-a-threat-intel-program/, steps 4 and 5 discussed identifying information needs/requirements and resources.  This blog is going to expand on the latter, identifying the resources.  Previously we discussed that when identifying resources, you start with what you have inside your organization and that sometimes the information is in a source to which you do not have access or it is sold from a vendor.  There is another category not previously mentioned, an obvious one, open source, freely available (usually on the Internet).  Free is good. Free is affordable. Free can be dangerous.  Before you decide to trust/use a free source you need to adequately vet that source.  Below I will talk about a recent project, a free source that was considered, how it was vetted and the takeaways which are 1) ensure accuracy and completeness of OSINT sources before they are considered reliable or relevant; 2) know the limitations of your OSINT sourced data; and 3) thoroughly understanding any filtering and calculations that occur before source data is provided to you.

GETTING THE RAW DATA AND CONVERTING IT FOR ANALYSIS

Raw Data on github here. https://github.com/grcninja/OSINT_blog_data  file name 20170719_OSINT_Data_Sources_Trust_but_Verify_Dyn_Original_Data.7z

Over the years I have worked on a few projects with ties to the Ukraine.  During a conversation with colleagues, we came up with a question we wanted to answer:  Is there any early warning indicator of a cyb3r attack on a power grid that might be found in network outages?  We reasoned that even though businesses, data centers and major communications companies have very robust UPS/generators available, an extended power outage would surely tap them and thereby cripple communications which could have a ripple effect during an armed conflict, biological epidemic, civil unrest etc.  So, I decided to see what data was out there with respect to network outages and then collect data related to power outages.  I settled on my first source,  Dyn’s Internet Outage bulletins from here http://b2b.renesys.com/eventsbulletin (filtered on Ukraine). who according to their site the organization “helps companies monitor, control, and optimize online infrastructure for an exceptional end-user experience. Through … unrivaled, objective intelligence into Internet conditions… “

The effort to script this versus the time it would take to copy/paste the few pages of data I needed to retrieve led me to opt to do this old school and copy the data straight off the web pages.  If this is something that I find I will do more than two or three times, then I will consider asking for API access and automating this task. However, for the time being, this was a one-off-I-am-curious-and-bored investigation & manual labor settled it.  I scraped the approximately 480 entries available at the time, going back to the first entries in March 2012 (there are more now of course).

I used a mix of regex-fu & normal copy/paste to morph the data into semicolon delimited lines of data.  I chose semicolons because there were already commas, spaces, and dashes in the data that I was not sure I wanted to touch and since I had less than 2 million rows of data and I love MS Office, I opted to use semicolons so that I could manipulate it in MS Excel more easily.

Below is an example of a typical entry:

6 networks were restored in the Ukraine starting at 16:03 UTC on April 11. This represents less than 1% of the routed networks in the country.

100% of the networks in this event reached the Internet through: UARNet (AS3255).

Let the metamorphosis begin:

replace -> networks were restored in the Ukraine starting at ->

6;16:03 UTC on April 11. This represents less than 1% of the routed networks in the country.

100% of the networks in this event reached the Internet through: UARNet (AS3255).

delete ->  on

6;16:03 UTC April 11. This represents less than 1% of the routed networks in the country.

100% of the networks in this event reached the Internet through: UARNet (AS3255).

replace -> . This represents -> YYYY; [I saved each year’s data as separate text files to make this search/replace easier later so that I can do one year per file]

6;16:03 UTC April 11 YYYY; less than 1% of the routed networks in the country.

100% of the networks in this event reached the Internet through: UARNet (AS3255).

replace -> less than ->  with < less than symbol

6;16:03 UTC April 11 YYYY; < 1% of the routed networks in the country.

100% of the networks in this event reached the Internet through: UARNet (AS3255).

replace ->  of the routed networks in the country. (including the newline characters) -> ;

6;16:03 UTC April 11 YYYY; < 1% ; 100% of the networks in this event reached the Internet through: UARNet (AS3255).

replace ->  of the networks in this event reached the Internet through:  -> ;

6;16:03 UTC April 11 YYYY; < 1% ; 100;UARNet (AS3255).

replace -> (AS -> ;AS

6;16:03 UTC April 11 YYYY; < 1% ; 100;UARNet;3255).

delete -> ).

6;16:03 UTC April 11 YYYY; < 1% ; 100;UARNet;3255

At this point, each of the individual original files contains mostly lines that look like above with the added caveat that I replaced YYYY with the appropriate year for each file.

And like any good analyst I began to scroooooooolllllllllll through over 400 lines hoping my eyes would not catch anything that wasn’t uniform.  Of course, I wasn’t that lucky, and instead I found entries that were ALMOST worded the same, but just different enough to make me wonder if this “reporting” was automated or if it was human.

  • 59 networks experienced an outage…
  • 30 networks experienced a momentary outage…
  • 6 networks were restored…

Do you see it?  I now had three kinds of entries, “restored” which implies that there was an outage.  I also had “outage” and “momentary outage”.  I performed a few more rounds of semicolon replacement so that I could keep moving and noted the issues.

DATA QUALITY AND ANOMALIES

I noted that I had inconsistencies in the entries, but I wasn’t exactly ready to write off the source.  After all, they are a large, global, reputable firm and I believed that there could be some reasonable explanation, and continued to prepare the data for analysis.

The statement “This represents less than 1%” isn’t an exact number so I reduced the statement to “;<1%” to get it into my spreadsheet for now. I wanted to perform some analysis on the trends related to the impact of these outages and planned to use the ‘percentage of networks affected’ value.  To do this, I was going to need a number, and converting them seemed like it should be relatively easy.  I considered that new networks are probably added regularly, although routable networks not very often, and decided I should try to estimate the number of networks that existed just prior to any reported outage/restoration based on a known number.  So, I turned to the statements that were more finite.  Unfortunately, to my surprise, there were some (more) serious discrepancies.  For example, on 5/8/2012 there were two entries one for 14:18 UTC, the other 11:39 UTC.  The first stated that 97 networks were affected which represented 1% of the routed networks in the country (not less than 1%, exactly 1%), and math says 97/.01 = 9,700 networks. The second entry stated that 345 networks were affected which was 5% of the routed networks, and math says that 345/.05 = 6,900 networks.  How could this be?  In less than three hours, did someone stand up 2,800 routable networks in Ukraine?  I don’t think so.  I performed the same calculations on other finite numbers for very close entries most of them on the same day and found the same problem in the data.  This is no small discrepancy and I can assume that the percentages being reflected on the web page have undergone some serious rounding or truncation, and therefore decided to remove this column from the data set and not perform any analysis related to it.  Instead I kept the value that, for the time being, I trusted, the exact number of networks affected.

In the data set, there were 10 entries that did not list a percentage of networks that were able to reach the Internet either during the outage or after the restoration (the range of networks affected was from 7 to 105), and these 10 entries were removed from the data set.  These 10 entries also lacked data for the affected ASN(s) as well. To avoid skewing the analysis, these incomplete entries were removed.  There was one entry of 23 affected networks 2/17/2014, where the affected ASN was not provided, but all other information was available, and an entry of UNK (unknown) was entered for this value.

Additional name standardization was completed where the AS number (ASN) was the same, but the AS names varied slightly such as LLC FTICOM vs. LLC “FTICOM”.

The final data set content that remained was 97.8947% of the original data set with data from March 22, 2012 16:55 UTC to July 5, 2017 21:40 UTC.

Let’s review the list of concerns/modifications with the data set:

  1.  inconsistent reporting language
  2.  lack of corresponding outage and restoration entries
  3.  inconsistent calculations for the percentage of routable networks affected
  4.  incomplete entries in the data set
  5.  AS names varied while the AS number remained the same

THE ANALYSIS TO PREPARE FOR  ANALYSIS

First, I started with the low hanging fruit.

  • Smallest number of networks affected: 6
  • Largest number of networks for one ASN affected: 651 (happened on 5/18/2013)
  • Average number of networks affected: 74.93534483
  • How many times were more than one ASN affected? 253 which is slightly more than half of the total events.
  • Were there any outages of the same size for the same provider that were recurring? YES
    • ASN 31148, Freenet (O3), had three identical outages of 332 networks 10/2/2014, 1/8/2015, 11/25/2015
    • ASN 31148, Freenet (O3), had four identical outages of 331 networks 8/11/2014, 10/6/2014, 11/5/2015, & 8/3/2016

At a glance, it seems like the low hanging fruit are what was expected, but I am still not sold that this data set is all that it is cracked up to be, and I decided to check for expected knowns.

The Ukraine power grid was hit on Dec 23, 2015 with power outage windows of 1-6 hours; however, the only network outages reflected in the data for Dec 2015 are on 1, 2, 8, 27, 30 & 31 December. The next known major outage was more recent, just before midnight on 17 December 2016 and again, there is not a single network outage recorded in the data set.  This puzzled me, so I began to look for other network outages that one would reasonably expect to occur.  In India, 2012 there were outages on 30 & 31 July https://www.theguardian.com/world/2012/jul/31/india-blackout-electricity-power-cuts and according to The Guardian, sources “Power cuts plunge 20 of India’s 28 states into darkness as energy suppliers fail to meet growing demand” and if you follow the article, one of many, you get reports that power was out up to 14 hours.  [I encourage readers to check sources, so you can find 41 citations on this wiki page https://en.wikipedia.org/wiki/2012_India_blackouts].  Then I went back to the data source, filtered on India, and magically there were no reported network outages that paralleled the power outages.  Then I decided to start spot checking USA outages listed here https://en.wikipedia.org/wiki/List_of_major_power_outages because this list has the following criteria:

  1.     The outage must not be planned by the service provider.
  2.     The outage must affect at least 1,000 people and last at least one hour.
  3.     There must be at least 1,000,000 person-hours of disruption.

And sadly, I chose three different 2017 power outage events none of which seemed to be reflected in the data set of network outages.  It was time to decide if the data set I wanted to use was going to meet my needs.  After reviewing what I had learned after normalizing and analyzing the data I summarized the concerns:

  1.  inconsistent reporting language
  2.  lack of corresponding outage and restoration entries
  3.  inconsistent calculations for the percentage of routable networks affected
  4.  incomplete entries in the data set
  5.  AS names varied while the AS number remained the same
  6. known events aren’t found in the data set

All things considered, I opted not to use this OSINT data source for my analysis as there were too many discrepancies, especially the lack of “known” events being reflected.

THE LESSON & CONCLUSION

Because this was a sudo personal endeavor that only paralleled some of the work I have previously done, I took a very loose approach to the analysis and I did not verify that the data set I was considering met my needs regarding completeness & accuracy.  I did however assume that because the data was being made public by a large, reputable, global company that they themselves were exerting some level of quality control in the reporting and posting of the data.  This assumption of course is flawed.  Currently, I am unable to come up with an explanation for so many multiple data points being absent from the dataset.  It is possible that this major global Internet backbone company simply has no sensors in the affected 71% of India or they are only in the 29% that weren’t affected.  Perhaps all their sensors in the Ukraine were down on the exact days of the power outages there.  It is also very possible that there were enough UPSs and generators in place in Ukraine and India that despite power going out, the networks never went down.  At this point, I’ve decided to search for answers elsewhere and am not spending time to try to figure out where the data is as this was just a personal project.

TAKEWAYS

Data sources need to be vetted for accuracy and completeness before they are considered reliable or relevant.  When considering a source, check it for known events that should be reflected.  If you do not find these events and you wish to use the source, ensure you understand WHY the events are not being reflected.  Knowing the limitations of your OSINT sourced data, is critical and thoroughly understanding any filtering and calculations that occur before it is provided to you is just as vital to performing successful analysis.  This kind of verification and validation should also be repeated if a source is used in OSINT collection.  Because a data source is not appropriate for the kind of analysis that you wish to do for one project does not mean that it is invalid for other projects.  All OSINT sources have some merit, even if it is that they are an example of what you DO NOT want.

Leave a Comment