Saturday 1 August 2009

The perils of thoughtless analysis

In April the US Directorate of National Intelligence's Open Source Center - a part of the American intelligence community that produces reports based on open source material - used data from the Worldwide Incidents Tracking System to analyse violent incidents in Afghanistan.
The report was for official use, but a copy has ended up on the website of the Federation of American Scientists where it can easily be accessed.
It's called Afghanistan - Geospatial Analysis Reveals Patterns in Terrorist Incidents 2004-2008 and uses some novel techniques in an effort to make violence in Afghanistan more understandable, from a military point of view.
When I checked, the WITS database had details of more than 4,000 incidents in Afghanistan over this period. Not all record geographical details. However, the data is strong and it allows all kinds of analysis, including the following:
mapping incident density, identifying the dominant ethnic group where incidents occurred, mapping incidents by district or province, identifying seasonal changes in patterns of attack, distributions of deaths and kidnappings, comparisons with events in neighbouring Pakistan, etc.
The results are mixed. We find out, for example, that it is possible to predict, with reasonable accuracy, where attacks will occur, based on previous attacks. This is hardly rocket science, but potentially useful in a situation where troops are rotating every six months or so. Incidents per district can be worked out, as can who carried out the attacks.
The report includes a table on attacks showing in one column 'perpetrator' and in the next column '%age of attacks'. Problem: the categorisation of who carried out an attack rests with a US platoon commander or Afghan policeman who may often know little about the complexity of Taliban politics.
So we find out that 64% of all attacks are carried out by 'Taliban'. 'Unknown' accounts for 33 percent. The remaining three per cent is divided between 'Taliban/al-Qaeda'(2%), 'Al-Qaeda' (1%), 'Taliban/Other' (0.3%), 'Hizb-i-Islami' (0.21%), 'Islamic Jihad Union' (0.17%), and even (I have no idea how) 'Taliban/Nigeria' (0.02%).
This categorisation bears no relationship to the reality of the Taliban on the ground. Most analysts accept that there are around seven factions. All this is lost through the analysis because the people deciding who carried out the attack are not able to make an accurate choice. In computer teminology this can be described as crap in = crap out.
The same problem dogs the analysis of incident type, where we are told that 42 per cent of attacks are IEDs. Next comes 36 per cent for 'Armed attacks'. 'Ambush' rates just 0.46% of all incidents - surely an underestimate? And what about attacks that start with an IED explosion and are followed up by armed attacks? They don't seem to exist.
Some information is quite interesting. The analysis shows, for example, a trend that attacks move south and west in the winter and north and east in the summer. It also shows that hotspots follow the main national highway and predominately fall in the Pashtun ethnic areas in the south. Most are also very close to the Afghanistan-Pakistan border.
The lesson here is that computers are wonderful at processing information in lots of interesting ways. But if you give them rubbish data, they will probably mislead you. With cleaner information, this kind of analysis could be very helpful indeed.

No comments: