Where we get our data

– □ x

OUR MISSION OF
RADICAL TRANSPARENCY

Where we get our data

Where We Get Our Data

Data. It’s everywhere. But which data can you trust and which will lead you down the dark path toward despair and demise? We use scraping to extract data from “trusted” sites on the web and compare that data with (large sets) to find inconsistencies that help us deduce things like trustworthiness, confidence, etc.

EXTRACTING DATA FROM THE WEB IS DECEPTIVELY SIMPLE.

It’s estimated there are currently 6 billion indexable pages and 1.2 million terabytes of accessible information available at any given time. And it’s growing exponentially every day.

The internet is enormous. It utilizes a wide range of technologies to deliver the simplest of web pages directly to your browser. I don’t have the pretension to explain how these technologies work. However, I do intend to provide you with the big picture of how we harvest data. And how it’s converted to meaningful data.

You will be happy to know we do not have an army of workers manually extracting data using copy/paste. Such an operation would be time-consuming, and most importantly error-prone. Humans make too many mistakes. As individuals, we make preventable errors every day. It’s just human nature.

Because we scrape millions of pages for the most accurate and up to date information. We use automation, and the reality is without it the project wouldn’t be viable.

WHAT IS SCRAPING?

In a nutshell, scraping is the harvesting of data from a web page or similar resource. It is sometimes referred to as ‘web scraping’, ‘web harvesting’ or ‘web data extraction’.

HOW DO WE USE IT?

PRICE MONITORING

We look for price changes, sales, new, refurb and discontinued products and more.

NEWS AGGREGATION

We use sentiment analysis, as an alternative data source for products, services, etc. featured on the site.

SOCIAL MEDIA

We look for social signals & Influencer activity which includes looking at follower growth and other mechanics.

REVIEW AGGREGATION

We extract reviews from a range of websites related to products, services and more.

SEARCH ENGINE RESULTS

We monitor search engine result page (SERP) activity. Including videos, images and marketplaces.

HOW DO WE DO IT?

Due to our system, we created a multi fascinating process we call TAVNO (Target, Acquisition, Validation, Normalisation & Output). In order to build and maintain our model, all elements play a key part.

In principle, the process runs sequentially. However, each segment of the process has its own set of parameters and arguments which must complete in order to finishcomplete its cycle. I won’t go into too much detail on the individual segments but here is a breakdown of the model:

TARGET: Websites are selected based on a list of predefined criteria; we then analyze the structure of those sites and identify the valuable information and metrics contained within them. Each site is typically unique. Therefore the process can be wide-ranging.

ACQUISITION: The second stage involves using the information we have identified in the previous stage to define data points. We then scrape the resource for the most insightful and important information.

VALIDATION: Accuracy of information is of the utmost importance. We adhere strictly to the computer science principle of GIGO (Garbage In Garbage Out). So we validate all information we have collected and use several algorithms for data integrity, authenticity and authentication.

NORMALISATION: Once we have acquired the information by scraping and validating it. We use a variety of techniques to normalise the data and create meaningful relationships between various metrics.

OUTPUT: To use the data as meaningful information. We use a set of predefined associative data structures to map the information onto our site. This enables us to provide actionable insights for you to utilise for a wide range of decisions.

PUBLIC DATA _^VS PRIVATE DATA

We’ve been throwing the word “data” around a lot. Everyone has, and it hasn’t always been positive. So let’s get ahead of some questions and make things clear.

HERE’S WHAT WE’RE NOT DOING

Performance criteria are finer qualities that live beneath a larger “umbrella”. Things like brightness, contrast ratio and color accuracy on a television could be considered performance criteria. Most of the criteria are also quantitative, meaning you won’t see some wishy-washy explanation. Brightness is measurable, and that’s how it’s reported and collected.

HERE’S WHAT WE ARE DOING

Categories of performance are the “umbrella” that covers the performance criteria we just talked about. The Experts should know (just like we know) what categories are applicable to and responsible for a quality product. From TVs to toasters, there’s always a set of qualities and metrics to be on the lookout for. We’re on the lookout for them. Google’s on the lookout for them.

Public data is the name of the game. If it’s public, it’s legal. That’s what was ruled by the US Ninth Circuit Court in HiQ Labs, Inc v. LinkedIn Corporation.

If it’s not public, it’s not legal to scrape. That’s it. We have no use for personally identifying information. We’re helping you make decisions on products by giving you facts about products, testers, testing criteria and markets. We’re not trying to appeal to whether or not you like the color red.

ABOUT US

HOW WE RATE & SCORE

WHERE WE GET OUR DATA

HOW WE MAKE MONEY

How We Test the Testers