theatlantic | The basic idea is simple.
Internet providers want to know as much as possible about your browsing
habits in order to sell a detailed profile of you to advertisers. If the
data the provider gathers from your home network is full of confusing,
random online activity, in addition to your actual web-browsing history,
it’s harder to make any inferences about you based on your data output.
Steven
Smith, a senior staff member at MIT’s Lincoln Laboratory, cooked up a
data-pollution program for his own family last month, after the Senate
passed the privacy bill that would later become law. He uploaded the
code for the project, which is unaffiliated with his employer, to GitHub.
For a week and a half, his program has been pumping fake web traffic
out of his home network, in an effort to mask his family’s real web
activity.
Smith’s algorithm begins by stringing together
a few words from an open-source dictionary and googling them. It grabs
the resulting links in a random order, and saves them in a database for
later use. The program also follows the Google results, capturing the
links that appear on those pages, and then follows those links, and so
on. The table of URLs grows quickly, but it’s capped around 100,000, to
keep the computer’s memory from overloading.
A program called PhantomJS, which mimics a person using a web browser, regularly downloads data from the URLs that have been
captured—minus the images, to avoid downloading unsavory or infected
files. Smith set his program to download a page about every five
seconds. Over the course of a month, that’s enough data to max out the
50 gigabytes of data that Smith buys from his internet service provider.
Although
it relies heavily on randomness, the program tries to emulate user
behavior in certain ways. Smith programmed it to visit no more than 100
domains a day, and to occasionally visit a URL twice—simulating a user
reload. The pace of browsing slows down at night, and speeds up again
during the day. And as PhantomJS roams around the internet, it changes
its camouflage by switching between different user agents, which are
identifiers that announce what type of browser a visitor is using. By
doing so, Smith hopes to create the illusion of multiple users browsing
on his network using different devices and software. “I’m basically
using common sense and intuition,” Smith said.
0 comments:
Post a Comment