Blackbox models of complex situations are tempting. Without the need to specify parameters, and not limited by choice of variables, you can throw in everything including the kitchen sink, and see how the model performs. But in real life, it is not that simple. With every variable that you throw in, you are increasing the complexity of the model exponentially as well as the computational workload. Although there are algorithms for dimension reduction as well as for assessing the significance of an input towards model accuracy, nothing beats a good old human being with expertise in the subject domain, for initial choice of input variables.
Let's do an academic exercise for the building of a model that (1) hopefully gives an objective, 'true' valuation of a house. (2) predict the future house price. We will attempt to build our model using a Neural Network, or a hybrid Neural with Genetic Algorithms for the last mile of calculation, to avoid local optima. Any human expert in real estate will tell you that for houses, values depend a lot on location, location and location. Let's keep this in mind and look at other possible input variables for the model[ and there's no lack of them]. Here is a possible list:
1. Rate of change in prices
2. Change in number of units sold
3. Number of unsold homes
4. Housing starts seasonally adjusted
5. Orders for new units
6. Interest rate level
7. Ratio of house price to median income
8. Ratio of rental income to house price
In the USA these are all common statistics released by the Realtor's Association or the Government at regular intervals to try and assess the condition of the Housing market.
Again, let's put aside 1-8 and solve the issue of 'Location first'.
Because Location matters so much in Housing, the model must first have a way to intelligently 'organize' the data into categories which are location-based. Characteristics such as postal code range may be one way to define the categories. But it may not necessarily be the only way. One way to categorize houses is by location plus any other significant secondary characteristics that matter e.g. prices, distance from schools, railway stations, even income. This sub-blackbox within the main blackbox can be solved using a Neural Network of the Self-Organizing Map [SOM] variety. Also known as a Kohonen map [after Finnish Professor Teuvo Kohonen], it is excellent for the visualization of high-dimensional data. It works in an unsupervised way, to classify input vectors according to their Euclidean distance. In the first stage which is a training stage, the map is drawn by running the SOM in an iterative process. After the training the mapping stage will 'map' any new input vectors it has not seen before into the zone appropriate for it, according to it's neighborhood characteristics. One way to speed up the training or SOM is to apply principle components analysis to the inputs first,
This dimension reduction procedure via principal components analysis, where the vectors are orthogonalized, is to cut off areas of overlap of information, and the end result may be just a column or two of eigenvectors which account for >90% of the information. But even before principal components analysis is performed, data processing can be applied to make the data more digestible for the model e.g. by normalization, standardization, compression such as to logarithmic representation, filtering, extracting trend, detrending, and so on. Make sure you label carefully which column represents what original input before you transform all this data into a mass of numbers only.
The main black box imposed on the SOM module is not a categorization algorithm but pattern-matching and prediction. Back-propagation Neural Nets will be more suitable for this part. Again, training sets and out-of-sample sets with data not seen before by the networks are required. How well this network performs in predicting future values of houses will depend on how well it can learn to 'generalize' using data it has not seen before. The problem of overtraining and overfitting decreasing the predictive capability of a network is real. The principle of using as little neurons as possible to obtain a convergence applies. The Genetic Algorithm portion will help overcome local optima.
Although semi-automated methods for determining choice of input variables and measuring their significance in prediction value,exist, it is so much easier if this is done by someone with domain expertise. This is especially so in the case of modeling housing markets, which are not only very differentfrok country to country but also from within countries from region to region. I am not a real estate expert and thus cannot tell which of the the input variables 1-8 will have more predictive value for the network. A real estate expert would be able to save a lot of time in the initial 'sketch' of the model.