TIC - THE INSURANCE COMPANY
This competition has a twofold purpose. One is to obtain a better understanding of the commonalities and differences between different inductive methods by applying a range of methods to a single analysis problem and comparing the solutions. The second purpose is to increase public awareness of available technology. The number of practical applications of Machine Learning is growing quickly and most applications are in data mining. However, still the technology is not well-known and a competition is a method for increasing public awareness. The competition is associated with the Benelearn workshop 1999 to be held in Maastricht, 5 November 1999. At the Benelearn workshop the winner of the competition will be announced and a session of the workshop will be dedicated to discussion of the solutions. Authors of successful solutions will be invited to present their solution and to contribute to a joint paper on the competition problem (to be written after the workshop).
The competition will focus on benchmarking competitive techniques like machine learning, neural networks, evolutionary computing and statistics and on demonstrating the use of these disciplines in addressing real-world issues. The competition should provide comparisons of the performance and relevance of the different approaches and give participants the chance to present their own methods and combinations of techniques.
2. Problem statement
The data used in this problem represent a frequently occurring problem: analysis of data about customers of a company, in this case an insurance company. Information about customers consists of 86 variables and includes product usage data and socio-demographic data derived from zip codes. The data was supplied by the Dutch data mining company Sentient Machine Research, and is based on real world business data.
The competition consists of two tasks:
Judgment of the results
The prediction and description tasks are evaluated differently.
This task involves predicting if a customer will have a caravan insurance policy from other data about the customer. The training set contains 5823 descriptions of customers, including the information if they have a caravan insurance policy and the testset contains 4000 customers of whom only the organisers know if they have a caravan insurance policy.
The prediction task is motivated by the decision to include customers in a mailing. Mail will be sent only to customers with a high probability of becoming caravan policy holders. The underlying problem is therefore to the find the subset of customers with a probability of having a caravan insurance policy above some boundary probability. (The known policy holders can then be removed and the rest receives a mailing.) The boundary depends on the costs and benefits such as of the costs of mailing and benefit of selling insurance policies. To approximate this problem we do not ask to predict "caravan policy holding" but to find the set of 800 customers in the testset that has the highest probability of having a caravan insurance policy. For each set of 800 the number of actual policy holders will be counted and this gives the score of a solution.
When two solutions have equal scores, the winner will be the solution that was submitted first. For details of the submission format see below.
The purpose of the description task is to give a "marketeer" a clear insight to why customers have a caravan insurance and how these customers are different from other customers. Descriptions can take the form of a regression equation, decision tree, neural network, linguistic description, graphical representation or any other form. The value of a description is inherently subjective. Submitted descriptions will be evaluated on comprehensibility and usefulness by the jury and an expert in insurance marketing and the degree to which the description was constructed automatically.
Solutions for the prediction and description tasks will be evaluated separately. Although semi-automated construction of solutions is permitted automatically constructed solutions will be evaluated rated higher.
3. Procedure and Dates
The competition will be supervised and submissions will be evaluated by Peter van der Putten (Sentient Machine Research, Amsterdam), Floor Verdenius (ATO-DLO, Wageningen), Arno Knobbe (Syllogic, Amersfoort), Guszti Eiben (Universiteit Leiden) and Maarten van Someren (Universiteit van Amsterdam).
The detailed problem description and the data for the analysis of the problems can be downloaded as:
datadescription.html Description of data
ticdata.zip (227 KB) Classified data (for training)
ticeval.zip (104 KB) Data for prediction (without caravan policy data)
Solutions and questions should be sent to: Maarten van Someren (firstname.lastname@example.org) or Peter van der Putten, Sentient Machine Research, (email@example.com)
The report of solution must consist of:
Time schedule for the Benelearn competition 1999:
Launch of competition
Deadline for submissions
Benelearn workshop / announcement of winner
Answers to the problem: the file targets.txt contains the actual caravan policy owners