Oil Market Modeling Using Q-Learning

Michael Wunder
BOOM Project

1. Introduction
In today’s world the availability of oil to a country is fundamental to achieve a relatively high standard of living and socioeconomic level. The petroleum industry plays an enormous role in the financial market; however it is becoming increasingly apparent that this natural resource is limited in supply. Business-as-usual oil usage will inevitably end the economic viability of typical oil production methods. This has serious implications for many industries worldwide. The transportation industry will suffer greatly without a change in fuel source. As an example, China houses a population of 1.3 million and its economy is rapidly growing at a rate of 8% to 10% each year. Automobile sales grew by 70% last year . China is the second biggest oil user, after the United States. Even conservative experts have estimated that within 50 years the demand for petroleum will exceed oil producers’ ability to harvest sufficient amounts at reasonable prices.

The search and production of oil is clearly a complex task, involving many factors. However, by using an artificial intelligence approach, we hoped to demonstrate a simplistic model of the oil market. In the simulated oil market, the oil producers, or “agents” are able to make simple decisions based on probabilistic knowledge gained from experience. An agent gains experience through repeated iterations through the “world”, where a reward is assigned if the agent takes a profitable action. The learning must be generalized so that agents perform well (i.e. increase profit) in all possible situations through intelligent decision making.

After experimentation with neural networks, we were able to achieve superior results by implementing a Q-learning algorithm. Reinforcement learning enabled the oil-producing agents to dynamically “learn” how to make better decisions.

Accurate modeling of the oil market has significant implications in today’s economic climate. Many investment organizations are using artificial intelligence methods to model the stock market. Accurate results on historical data imply that these models can be used to make predictions about the future movement of stocks. Refinement of this oil market simulation could prove useful to those involved with petroleum-related investment decisions.

2. Problem Definition & Algorithm
2.1 Task Definition
At its most basic level, the problem we are exploring is how to make agents learn correct actions in a state space, where actions are non-deterministic and feedback is given after every action. Agents are affected by the decisions of other agents and performance of individuals and the system as a whole is evaluated at the end of 100 turns (actions). Specifically, the agents are oil producers existing on a map of oil, distributed according to predetermined probabilities. The distribution does not change but the agents’ initial positions do. Agents earn money by building rigs on whichever square they happen to occupy. The amount of oil produced is determined by a simple formula:

This turn’s oil = (Total Oil on square) * (# of rigs on square) / 100.

The price received for this oil is determined by an inelastic demand curve.

P = 5 + (D / 2) / (S + 1)

where D = 24 and S = this turn’s oil produced by all agents.

Thus the number of rigs on an agent’s square, the amount of oil on that square, and the price together make up the state of each agent. At the same time an agent receives payment for its oil produced, it incurs costs based on how many rigs it has bought previously. Each rig it currently owns costs five dollars, and each rig it has bought within twenty-five turns costs two additional dollars, even if it has sold some of those rigs. After selling a rig the agent receives a one-time reimbursement of twenty-five dollars. Each agent has seven actions at its disposal when on a square with no rigs. It can do nothing, build or sell one rig, and move in one of four directions. These directions are marked in order of probability. When an agent has built rigs it cannot move until it sells them all. At the end of a turn, the agent receives feedback in the form of revenues minus costs. The problem we are posing is: what is the best course of action to take at any given stage? Given information about current state, which action should be taken? No knowledge is assumed and the agent must learn how to proceed.

This problem is interesting from an artificial intelligence standpoint because there are too many states for a person to individually mark for the best action. In many cases there will be no correct answer, just an unknown probability value for the next state and this turn’s reward. These kinds of decisions to be made are simple ones but only after many turns of feedback can they be reliably discerned. There are elements of risk analysis; every time a rig is bought represents an inherently risky investment. Once that decision is made it will have to endure costs for the next twenty-five turns regardless of what happens in the ensuing time period. We hypothesize that machine learning can help us with this task.

2.2 Algorithm Definition
Given the nature of this problem—a choice between actions with nondeterministic results, no transition model, and unclear knowledge of the reward function—it was important that we use the right learning algorithm. After consideration of several methods, we decided that Q-learning would be the most appropriate one to use. Because the structure of the system would give immediate rewards on every turn, there was no need for training examples in form.

The table of Q/N values for each state-action pair was classified by four parameters: number of rigs on square, amount of oil on square, current price, and possible action. The number of rigs was counted from zero to fifteen. The amount of oil, which could rise as high as 1100, was broken into parts of 32. That is, if there were 319 units of oil on a square, it would count as ten. This gave 32 values for oil amount. Price ranged from zero to thirty-one, but usually was limited to about eight possibilities because of the demand function. Each rig/oil/price combination was further divided into eight actions: build, sell, nothing, and the four moves (and one blank). This gave us 131072 possible state-action pairs (2^17) which we store in an array of that size. Because each of these values is a power of two, this gave us a quasi-binary Q-value table. The same table is used by all agents.

Each element of the table contains both the N-value and the Q-value, which are updated whenever an action out of a state is chosen. The update function is Q(a,s) = Q(a,s) + a (R(s) + ? max-a’(Q(a’,s’)) – Q(a,s)) where a = 60/(59 + n), R(s) = the cash income of the state s, and ? = 9/10 (to signify the importance of the time dependency in determining current action). To determine an action, the algorithm chooses the maximum F-value of the seven possible actions. The F-function returns 1000 for an action if it has been tried less than five times and the Q-value otherwise. Ties are broken at random. Thus the algorithm tries each action at least five times before it is guaranteed to choose the best Q-value among them.

To illustrate this algorithm in action, consider an agent that is in state 0-12-12. That is, there are 0 rigs on the square, there is about 400 oil, a relatively high amount (400/32 = 12), and the price is 12. Three actions have been taken less than five times: build, move to highest rank, move to lowest rank. These have the maximum F-value (1000) so build is selected randomly. At the completion of the turn, the system finds that this agent has one rig on its square and finds the oil produced to be (1 rig * 400 oil)/100 (rig * round) = 4 oil/round. This additional oil in the supply lowers the universal price to 10, so when income is calculated, this agent receives 4*10 = 40 units of money, and pays 7/turn for the new rig. In the next round the agent is in 1-12-10. In the update phase, the agent will access 0-12-12-1 to be updated (build is action 1). If the entry for this state-action pair is 28, and the maximum Q for state 1-12-10 is the build-value, 34, then we can find the update equation. Q(0-12-12-1) = 28 + 60/63(33 + 9/10*34 – 28) = 62. Therefore, an agent now becomes much more likely to choose this action the next time it finds itself in this state.

The same principle works in reverse. An agent in 2-0-8 will find itself losing at least 14 every round by taking the do-nothing action. A build choice would worsen the problem, and this Q-value for build would steadily worsen every time it was chosen. However, the sell action would net the agent 25 – costs, and so on average over time agents would converge towards selling in this case.

3. Experimental Evaluation
3.1 Methodology
In order to determine the performance of the agents, we compared the results of the Q-learned oil market with results from a control market in which agents were instructed to perform certain actions based on equations that were developed using basic investment logic. These calculations considered conditions such as number of rigs, oil remaining on the current location, and the price of oil. Such a simplistic decision making structure is possible only because of the rudimentary nature of our oil market. In more complex systems, it would be much more difficult to develop successful governing equations for this rigid equation-based approach. In the Q-learning model, actions were determined using reinforcement learning. Between these two decision making approaches, we compared agent cash holdings and total oil production. Average agent cash holdings was calculated to better demonstrate the effect of Q-learning on the shared Q value table. We determined that average cash holdings was a valid metric after observing consistent performance among all agents in the oil market. Agents are given random starting locations on the map. Repeated iterations through the oil market world enhanced intelligent decision making ability. A sufficient number of iterations was essential to optimize the Q value table, thereby optimizing performance. Several hundred iterations were required for convergence. Trials were performed using one agent, three agents, and ten agents. We compare the results in Section 3.2. The Q-learning algorithm works because the decisions made update the Q value table with results over many iterations. Desirable results in this model would take the form of appropriate values for build/sell/move decisions in various states. For example, when agents are in states with low oil they should move to another square/state with higher reserves. In states with large amounts of oil and high prices, we would want to see them build a rig. This sequence would continue until the amount of rigs becomes uneconomical. Eventually the oil would start to run out and it would make sense for the agents to sell off their rigs. This result is consistent with the Q value tables we saw produced in the model. Over time good decisions were rewarded and bad ones punished, and agents would converge on behaviors consistent with high profits.

3.2 Results We found our learning algorithm to be most successful in the training of a three agent oil market. Oil production approached and just exceeded that of the control market. Figure 1

Meanwhile, the average cash holdings of all three agents surpassed that of the control market. Figure 2

Results from trials with one agent demonstrates that a market containing just one agent performs nearly as well as the control market in which all decisions are dictated by a series of equations, especially with average cash holdings. Figure 3

Figure 4

Similar results are produced in a market containing ten agents. Figure 5

Figure 6

3.3 Discussion
Best performance is observed with three agents, although increases in cash holdings and oil production occurred with the one and ten agent markets as well. In all three trials, we see convergence toward specific values. These convergence values vary between the one, three, and ten agent markets. This variation is difficult to predict and is consequence of the interplay between the amount of available oil and the number of agents pursuing that oil. Poor performance in the one agent market may indicate that there is insufficient competition for the Q-learning algorithm to optimize decision making. Similarly, the sub-optimal performance in a ten agent market can be attributed to scarce resources and high competition as agents congregate towards concentrated oil reserves. Performance in a ten agent may improve with increased map size. The graphs show how in the best case, the Q learning agents converge to or surpass performance of the control market. This result is especially dramatic when comparing cash totals. Convergence is less pronounced in the oil production measure as the controlled group was set up to maximize oil production, while the learners are conditioned to produce more cash. This result shows somewhat superior performance from the learning market as it achieves higher profits from less oil produced.

4. Related Work
While we found a fair amount of research on such topics as stock market prediction or pricing simulations, there was surprisingly little work done with the precise problem we were looked at. That is, close examples to models of oil markets in which agents invested money to produce a return were scarce. However we learned a good deal from “Dynamic Pricing of Software Agents.” (Kephart, Hanson, and Greenwald 2000) This paper also investigates Q-Learning algorithms, but in regards to price forecasting. Here the agents are deciding what price to sell at in relation to their competitors. The state of the algorithm is the current prices of other competitors. The function here represents the future-discounted profit. This characteristic is very close to our algorithm with the key difference that their agents are choosing prices, while ours are choosing whether to invest on a given spot. The authors state that the agents in their model learning with Q would not converge on a set price. This conclusion would not be relevant to our model because how the game turns out depends on initial conditions, which are random. The agents are always divergent from each other and sometimes the spread is quite large. What is important from our perspective is that the average converges. The fact that there is variation only means that some agents have locational advantages over others.

5. Future Work
There are a number of improvements we were not able to make in the interest of time. On the problem side, we would have liked to include more agent control over the price. In the current form, the price is decided by a demand formula. However, it is clear that if pricing decisions were included in the agents’ choice of actions, we would observe dramatic changes in overall market behavior. This addition would be a simple and interesting one to add and there is no reason not to pursue it in future work on this topic. Overall, the system we made was very basic and can be expanded to include any number of different aspects of markets or multi-agent systems.

Given the ambiguous course of events resulting from any given action, it would be desirable to have a generalized function representing the best choice given some state. Because there are so many possible states, a function stored in a neural network or other structure would likely lead to good results. Again, this additional feature could result in changed behavior but we were not able to include it in the time provided.

6. Conclusion
In the end we found successful results using a Q-learning algorithm for this particular problem. Clear improvements over time from random behaviors to rational decision-making were observed. The problem we investigated was interesting from several perspectives. It is useful to explore possible applications of Q-learning, reinforcement and learning in general. Second, exploration of economic agents can lead to better understanding of complex systems such as markets. If we can better understand why agents make certain decisions in certain situations, we can better predict future behaviors of these complex systems. Finally and most specifically, this model is useful because it simulates very basic behavior of agents in markets dealing with probabilistic distribution of scarce resources such as oil. This system in an extended form can also exhibit levels of risk in such environments as markets. Work in this area has barely begun, and it promises to be an exciting area of artificial intelligence in the future.