Computer-assisted reporting (CAR) remains one of
the biggest advances in the past 20 years for investigative reporting. This
section has information about how CAR can assist you in your work.
CAR covers two main areas, data mining and online research. We
have also included sections on anonymity, as protecting your identity and data
when online becomes more important.
data mining
Excel 1
Access
SQL – to come
online investigating
finding people
advanced search
finding hidden documents
finding website owners
automated web browsing
anonymity
published stories
All of the following stories came from analysing data produced
by Freedom of Information Act enquiries.
Elena Egawhary (front page) in The Guardian, July 2007.
Heather Brooke in The Times, December 2007.
The Fire Brigades Union in Metro, February 2008.
No Comments »
online investigating
finding people
From finding whistleblowers or experts in esoteric
fields, there are a number of methods that can both improve accuracy and save
time for investigators.
This section has information about the ‘hidden web’; and those
subscription, free, and non-indexed sources (including directories and archives)
that will help you to find contributors online.
Bespoke people finders
These are the most useful tool if you have the name of the
person you are looking for but want more information such as contact details or
articles they’ve written.
192.com www.192.com
(subscription)
Or try one of the free (albeit less robust) alternatives:
123 people
www.123people.com
Pipl www.pipl.com
Yasni www.yasni.com
Yoname www.yoname.com
Search engine functions (in Google)
Below are some of the advanced functions available to the
searcher, for more detailed ways of searching, including by document type, see
newsgathering.
The domain function: site:
When using the domain function there are three elements involved, if you were
looking for an academic expert you would used the following:
-
subject term/s
-
the term connecting the subject to his/her profession (ie
expert, department, professor etc.)
-
the domain function: site: eg.ac.uk (for UK universities)
Compare:
“solvent abuse” professor site:ac.uk
With:
expert solvent abuse
And compare:
fraud ~data professor site:.ac.uk
With:
expert data fraud
Specifying the subject, level of expertise and limiting the
search to academic urls can make finding experts much quicker.
A full list of top-level domains can be found at
NORID domains.
Google’s cached option will show you the page as it was when
it was first indexed, so you won’t miss out on your terms if the page has
changed.
You can also use the domain function to find local pressure
groups/nimby groups/associations and non-commercial bodies. For example, when
looking for pressure group/s opposed to the building of phone masts.
Compare:
group “telephone masts” opposed
With:
group “telephone masts” opposed site:.org.uk
You can also use the domain function to find discussions (and
hence contributors) in Facebook, and other social networks.
See:
“I worked” “lehman brothers” site:facebook.com/topic
You can only do this through Google, not within the Facebook
search, see the
Slewfootsnoop blog for more information.
Finding contributors via social networks
Facebook is one of the most popular
social networking sites. Many of its users are interested in international
social and political issues, and some are experts in their field – the site
contains groups based on themes and issues from around the world. Try searching
ecology society site:facebook.com
Likewise, Myspace has
similar groups – try searching their groups for
alternative energy.
Other social networks are popular in different parts of the
world. For example, if you have a Google account and are interested in finding
contributors from South America – give Orkut
a try. It’s very popular in Brazil and India.
Likewise, Badoo is more
popular in the rest of Europe than in the UK, and they are even making an effort
to progress in the Russian Social Network market.
But perhaps the best place to start is amongst those services
which allow online communities to create their own social networks.
Ning is a good example of this.
If you are looking for professional communities then
LinkedIn is probably a good place to
start. See the Slewfootsnoop blog for a
comparison between LinkedIn and Facebook for finding people.
It may be possible to find contributors and potentially useful
actuality from photo-sharing network
Flickr
– try this
tag-search for local pollution.
Technorati
is currently the best known search engine for blogs. An alternative is
Google’s Advanced Blog Search.
You can also used the
advanced search on
Twitter
Contributor finding in pre-web 2.0 sources
Try using Google scholar to keep
up-to-date on the latest academic findings and experts in your field.
Amazon advanced search is also a great
place to find experts around a subject matter.
And don’t discount the various forums, and boards people use
to express themselves, and flag up issues worth investigating – you can even
create your own search engine to track people who contribute to different
online forums.
Contributor Finding Online – Murray Dick – July 2009
useful links
ProfNet
A database or communications professionals and PR people.
advanced search
This section outlines how you can make your
searching more accurate. It is taken from notes and lectures by Murray Dick.
narrowing searches
You can tighten your search results using the following:
AND: (implicit)
OR: blair wmd OR weapons
NOT: rangers -qpr
phrase search: which is the “richest bank in the UK” (try with and without
quotes).
Wildcard: Google doesn’t support the wildcard in the way it is
conventionally used in other search engines – MSN, Yahoo or Exalead – it uses
automatic stemming.
Nevertheless, you can use a * in phrase-searching. Google
treats the * as a placeholder for a word or more than one word, where you want
to do an expansive search. For example, “corruption in the * industry” expert
can help you find experts in corruption in different fields.
The plus sign (+) allows you to stop Google from stemming your
words – if you are interested in a word in a particular case. It can also be
applied stop Google finding reference to certain words that link to (rather than
feature in) the pages you are searching from, when viewing cached content.
Lastly, it can be applied to media sources, allowing you to search stories about
a specific company in Google News.
Synonyms (~) for example: ~marriage will find references to
love, marriage, romance etc.
It’s worth bearing in mind that other engines offer an even
broader range of search operators. Exalead,
for example, permits atleast and proximity searching. Their atleast function
allows the searcher to find pages that feature a term prominently, which can be
useful when you are searching for backgrounders on people or issues.
The proximity search function allows the searcher to find
terms which occur close to each other, which can be useful when trying to
unearth connections between people and events in the news.
This
API seach allows proximity searching in Google results, albeit only where
the terms you wish to find are no more than three words apart. There’s more
information about how API searching can help in journalistic research on the
Slewfootsnoop blog.
the occurrences function: intitle:
This is used for finding reliable backgrounders, however, bear in mind that
standards in metadata vary widely. Think about what is included in the
professional sites’ web page titles. For example, if you want to find background
information (analysis, not news, professional not amateur) about Somalia’s
troubled political history:
Compare:
somalia crisis background
With:
somalia crisis intitle:Q&A
Instead of background you could try: depth, comment, analysis
or brief.
You can search for this terms in the url using the following
search:
inurl: Somalia analysis
searching through documents
By specifying the type of file you want to search within you can tighten your
search even further. Financial information is more likely to be held in and
excel spreadsheet that a web page, so limiting the search to within this type of
file produces more accurate results:
Compare:
house prices Greenwich
With:
“house prices” greenwich 2007 filetype:xls
Also try switching format to Powerpoint (filetype:ppt) for finding experts –
they are likely have demonstrated their expertise in presentations.
languages
You can also make use of the language selector in advanced search for article.
Compare:
scudetto “silvio berlusconi” (with and without filter switched to English).
links
This is how you find out who is linking to a site which can highlight bias, or
partisanship. In Google advanced search go to ‘date, usage rights, numeric
range’ and copy the url of the site you are checking where it says ‘find pages
that link to the page’. Other useful tools for doing this are
Back Link Watch and
iwebtool.
You can find out more about searching for
hidden documents elsewhere on this site.
Contributor Finding Online – Murray Dick – July 2009
useful links
Search Engine Watch
Provides data and ratings on the different search engines.
Startpage
Claims to be the world’s most private search engine as it does not record your
IP address.
A9
Searches e-commerce websites
Internet Archive
Also known as the Wayback Machine, this is a digital library of web sites as
they used to be.
Reseach clinic
Features links, tools and study material for professional researchers. The site
accompanies courses delivered by the BBC’s Internet research specialist, Paul
Myers.
automated web browsing
This handout is a supplement to the full
presentation given by Mike Schrenk at the cij summer school 2009.
The full presentation is available at:
http://www.schrenk.com/cij
Online research often requires repetitive downloading of web
pages. That process – along with extracting information found on websites, is
tedious and error prone. Screen scraping and iMacros allow journalists to
automate the process of computer aided research.
screen scraper
A screen scraper is a software that conducts automated
browsing activities on the internet. A primary purpose of a screen scraper is to
extract information from websites.
iMacros
iMacros is a browser plug-in that lets you to write ‘macros’
which are ‘pseudo’ programming tools that allow the automation of standard
programs (like browsers).
iMacros is available for Internet Explorer and Firefox. I have
had better results with Firefox and highly recommend its use over Internet
Explorer.
Location for iMacros download (for Firefox)
https://addons.mozilla.org/en-US/firefox/addon/3863
initiating iMacros
The iMacros button in Firefox is located in the browser tool
bar next to the url.
resources
Firefox download page
https://addons.mozilla.org/en-US/firefox/addon/3863
iMacros home page
http://www.iopus.com/
iMacros command reference
http://wiki.imacros.net/Command_Reference
iMacros user forums
http://forum.iopus.com/
Demo website
http://www.schrenk.com/cij/imacros_demo.php
command reference
The following is a lists of all available iMacros commands.
Each command has either zero or more parameters. If parameters can be omitted,
they are enclosed by square brackets. If several
choices are possible for the same parameter, they appear in brackets and are
separated by the | character. Integer numbers are denoted by the letters n or m,
all other name denote a series of
characters (strings).
The ‘ character indicates a comment. If a line starts with ‘
everything behind the ‘ is ignored. Typically this is used for comments or to
disable specific parts of a macro.
Note: a macro cannot have empty lines, as an empty line indicates the end of the
macro. So every line in the macro must have at least the comment symbol.
ADD result_var
added_value
Adds a value to a variable.
BACK
Opens the previously visited web page.
CLEAR
Clears browser cache and cookies on the hard drive.
CLICK X=n Y=m
[CONTENT=some_content] “Clicks” on the element at the specified X/Y coordinates.
CMDLINE variable
default_value
Sets the variable to a value retrieved from the command line.
DISCONNECT
Disconnects the current dial-up connection.
EXTRACT POS=[R]n
TYPE=(TXT|HREF|TITLE|ALT) ATTR=Anchor*
Extracts data from websites.
FILEDELETE
NAME=file_name
Deletes a file.
FILTER TYPE=IMAGES
STATUS=(ON|OFF)
Filters web site elements. Currently the support for filtering is experimental.
If you need any other data filtered, please let us know what kind of filter you
would like to see added.
FRAME F=n
Directs all following TAG or EXTRACT commands to the specified frame.
IMAGECLICK
IMAGE=image_file
CONFIDENCE=n
[CONTENT=some_content]
Sends a WINCLICK command to the specified image.
IMAGESEARCH
IMAGE=image_file CONFIDENCE=n
Searches for the input image specified via the IMAGE attribute.
ONCERTIFICATEDIALOG C=n
Selects the client side certificate from a dialog.
ONDIALOG POS=n
BUTTON=(YES|NO|CANCEL) [CONTENT=some_content]
Handles JavaScript dialogs.
ONDOWNLOAD
FOLDER=folder_name FILE=file_name
Handles download dialogs.
ONERRORDIALOG
BUTTON=(YES|NO) CONTINUE=(YES|NO)
Handles error dialogs.
ONLOGIN USER=username
PASSWORD=password
Handles login dialogs.
ONSECURITYDIALOG
BUTTON=(YES|NO) CONTINUE=(YES|NO)
Handles security dialogs.
ONWEBPAGEDIALOG
KEYS=some_keys
Handles web page dialogs.
PRINTPrints the current
browser window.
PROMPT prompt_text
variable_name [default_value]
Displays a popup to ask for a value. This value is stored in the variable.
PROXY
ADDRESS=proxy_URL:port [BYPASS=page_name]
Connects to a proxy server to run the current macro.
REDIAL ISP
Redials a connection.
REFRESH
Refreshes (Reloads) current browser window.
SAVEAS
TYPE=(CPL|MHT|HTM|TXT|EXTRACT|BMP) FOLDER=folder_name FILE=file_name
Saves information to a file.
SET variable_name
variable_value
Assigns values to built-in variables.
SIZE X=n Y=m
Resizes the iMacros Browser Window.
STOPWATCH ID=id
TAB
T=(n|OPEN|CLOSE|CLOSEALLOTHERS)
Sets focus on the tab with number n.
TAG POS=n TYPE=type
[FORM=form] ATTR=attr [CONTENT=some_content]
Selects a webpage element.
URL GOTO=some_URL
Navigates to a URL in the currently active tab.
VERSION BUILD=4213805
Specifies the version of iMacros that created this macro.
WAIT
SECONDS=(n|#DOWNLOADCOMPLETE#)
Waits for a specific time.
anonymity
The ability to remain anonymous can liberate journalists and
facilitate research that would otherwise be impossible. The internet provides
unique opportunities to conduct serious research while protecting your identity.
Anonymity becomes more important as regimes place added
restrictions on journalists’ ability to speak freely. Regardless of the measures
governments take, however, journalists are still able to publish stories through
the use of “proxies”.
Why anonymity is useful to journalists
Hiding your identity while doing research
Anonymous browsing techniques may protect your identity and thereby provide
greater access while conducting research.
Allowing you to perform repetitive research
Anonymity may protect you if performing automated or repetitive research tools.
Pretending that you are somewhere else
With certain techniques, you can conduct research while appearing to be doing so
from another country.
Protecting your sources
Your sources may use anonymity techniques to either protect their identity or to
make their story possible.
Defeating national digital defenses
Anonymity techniques can defeat national firewalls and get information out to
the rest of the world.
Rights to anonymity
Nations have varying views of anonymity and anonymous use of the internet. In
the United States, the Supreme Court has ruled repeatedly that the right to
anonymous free speech is protected by the First Amendment. A much-cited 1995
Supreme Court ruling in
McIntyre v Ohio Elections Commission reads:
“Protections for anonymous speech are vital to democratic
discourse. Allowing dissenters to shield their identities frees them to express
critical, minority views… Anonymity is a shield from the tyranny of the
majority… It thus exemplifies the purpose behind the Bill of Rights, and of the
First Amendment in particular: to protect unpopular individuals from
retaliation… at the hand of an intolerant society.”
A number of nations tightly control access to websites and
other online resources.
An introduction to the internet
Your IP address may identify:
Your country (location)
Your organisation, through reverse DNS look-ups
www.lookupserver.com
Possibly you!
Anonymous email
Anonymous email is possible via a product called
Nyms, which
allows the creation of disposable email addresses via the Nyms network.
Proxies
Proxies act as intermediaries and protect your identity there are different
types of proxies from different sources:
Open proxies are servers that either intentionally (or because
of misconfiguration) allow people to connect through their network, and assume
one of their network IP addresses. Open proxies are best avoided.
An example of website that lists open proxies is
www.xroxy.com/proxylist.htm
There are also commercial proxies that do a better job. For
example, Anonymizer
Tor
Another proxy alternative is the Tor project.
Tor is the proxy network that
facilitates journalism from some of the most hostile environments in the world.
It is free software and an open network that helps protects against a form of
network surveillance that threatens personal freedom and privacy, confidential
business activities and relationships, and state security known as traffic
analysis.
Tor was originally developed for the US Navy for the primary
purpose of protecting government communications. Today, it is used every day for
a wide variety of purposes by the military, journalists, law enforcement
officers, activists, and many others.
It protects you by bouncing your communications around a
distributed network of relays run by volunteers all around the world: it
prevents somebody watching your internet connection from learning what sites you
visit, and it prevents the sites you visit from learning your physical location.
Tor works with many of your existing applications, including web browsers,
instant messaging clients, remote login, and other applications based on the TCP
protocol.
Installing Tor
Tor for Firefox optimising Tor
in Firefox
Installing a proxy in your browser
From the Tor site
How Tor works
Getting up to speed on
Tor’s past, present, and future
Download Tor
This page is based Mike Schrenk’s talk at the CIJ Summer
School – July 2009.
useful links
Pretty Good Privacy
Computer program that encrypts files and documents on hard drives. Can be used
for emails.
Scramdisk
Computer program that encrypts hard drives.
Computer-assisted reporting (CAR) remains one of the biggest
advances in the past 20 years for investigative reporting. This section has
information about how CAR can assist you in your work.
CAR covers two main areas, data mining and online research. We
have also included sections on anonymity, as protecting your identity and data
when online becomes more important.
data mining
Excel 1
Access
SQL – to come
online investigating
finding people
advanced search
finding hidden documents
finding website owners
automated web browsing
anonymity
published stories
All of the following stories came from analysing data produced
by Freedom of Information Act enquiries.
Elena Egawhary (front page) in The Guardian, July 2007.
Heather Brooke in The Times, December 2007.
The Fire Brigades Union in Metro, February 2008.
How to Find Media Email
Addresses
Is exactly what it says.