|
|
|
Oil Market Modeling Using Q-Learning Michael WunderBOOM Project 1. Introduction In today’s world the availability of oil to a country is fundamental to achieve a relatively high standard of living and socioeconomic level. The petroleum industry plays an enormous role in the financial market; however it is becoming increasingly apparent that this natural resource is limited in supply. Business-as-usual oil usage will inevitably end the economic viability of typical oil production methods. This has serious implications for many industries worldwide. The transportation industry will suffer greatly without a change in fuel source. As an example, China houses a population of 1.3 million and its economy is rapidly growing at a rate of 8% to 10% each year. Automobile sales grew by 70% last year . China is the second biggest oil user, after the United States. Even conservative experts have estimated that within 50 years the demand for petroleum will exceed oil producers’ ability to harvest sufficient amounts at reasonable prices. The search and production of oil is clearly a complex task, involving many factors. However, by using an artificial intelligence approach, we hoped to demonstrate a simplistic model of the oil market. In the simulated oil market, the oil producers, or “agents” are able to make simple decisions based on probabilistic knowledge gained from experience. An agent gains experience through repeated iterations through the “world”, where a reward is assigned if the agent takes a profitable action. The learning must be generalized so that agents perform well (i.e. increase profit) in all possible situations through intelligent decision making. After experimentation with neural networks, we were able to achieve superior results by implementing a Q-learning algorithm. Reinforcement learning enabled the oil-producing agents to dynamically “learn” how to make better decisions. Accurate modeling of the oil market has significant implications in today’s economic climate. Many investment organizations are using artificial intelligence methods to model the stock market. Accurate results on historical data imply that these models can be used to make predictions about the future movement of stocks. Refinement of this oil market simulation could prove useful to those involved with petroleum-related investment decisions.
2. Problem Definition & Algorithm This turn’s oil = (Total Oil on square) * (# of rigs on square) / 100. The price received for this oil is determined by an inelastic demand curve. P = 5 + (D / 2) / (S + 1) where D = 24 and S = this turn’s oil produced by all agents. Thus the number of rigs on an agent’s square, the amount of oil on that square, and the price together make up the state of each agent. At the same time an agent receives payment for its oil produced, it incurs costs based on how many rigs it has bought previously. Each rig it currently owns costs five dollars, and each rig it has bought within twenty-five turns costs two additional dollars, even if it has sold some of those rigs. After selling a rig the agent receives a one-time reimbursement of twenty-five dollars. Each agent has seven actions at its disposal when on a square with no rigs. It can do nothing, build or sell one rig, and move in one of four directions. These directions are marked in order of probability. When an agent has built rigs it cannot move until it sells them all. At the end of a turn, the agent receives feedback in the form of revenues minus costs. The problem we are posing is: what is the best course of action to take at any given stage? Given information about current state, which action should be taken? No knowledge is assumed and the agent must learn how to proceed. This problem is interesting from an artificial intelligence standpoint because there are too many states for a person to individually mark for the best action. In many cases there will be no correct answer, just an unknown probability value for the next state and this turn’s reward. These kinds of decisions to be made are simple ones but only after many turns of feedback can they be reliably discerned. There are elements of risk analysis; every time a rig is bought represents an inherently risky investment. Once that decision is made it will have to endure costs for the next twenty-five turns regardless of what happens in the ensuing time period. We hypothesize that machine learning can help us with this task.
2.2 Algorithm Definition The table of Q/N values for each state-action pair was classified by four parameters: number of rigs on square, amount of oil on square, current price, and possible action. The number of rigs was counted from zero to fifteen. The amount of oil, which could rise as high as 1100, was broken into parts of 32. That is, if there were 319 units of oil on a square, it would count as ten. This gave 32 values for oil amount. Price ranged from zero to thirty-one, but usually was limited to about eight possibilities because of the demand function. Each rig/oil/price combination was further divided into eight actions: build, sell, nothing, and the four moves (and one blank). This gave us 131072 possible state-action pairs (2^17) which we store in an array of that size. Because each of these values is a power of two, this gave us a quasi-binary Q-value table. The same table is used by all agents. Each element of the table contains both the N-value and the Q-value, which are updated whenever an action out of a state is chosen. The update function is Q(a,s) = Q(a,s) + a (R(s) + ? max-a’(Q(a’,s’)) – Q(a,s)) where a = 60/(59 + n), R(s) = the cash income of the state s, and ? = 9/10 (to signify the importance of the time dependency in determining current action). To determine an action, the algorithm chooses the maximum F-value of the seven possible actions. The F-function returns 1000 for an action if it has been tried less than five times and the Q-value otherwise. Ties are broken at random. Thus the algorithm tries each action at least five times before it is guaranteed to choose the best Q-value among them. To illustrate this algorithm in action, consider an agent that is in state 0-12-12. That is, there are 0 rigs on the square, there is about 400 oil, a relatively high amount (400/32 = 12), and the price is 12. Three actions have been taken less than five times: build, move to highest rank, move to lowest rank. These have the maximum F-value (1000) so build is selected randomly. At the completion of the turn, the system finds that this agent has one rig on its square and finds the oil produced to be (1 rig * 400 oil)/100 (rig * round) = 4 oil/round. This additional oil in the supply lowers the universal price to 10, so when income is calculated, this agent receives 4*10 = 40 units of money, and pays 7/turn for the new rig. In the next round the agent is in 1-12-10. In the update phase, the agent will access 0-12-12-1 to be updated (build is action 1). If the entry for this state-action pair is 28, and the maximum Q for state 1-12-10 is the build-value, 34, then we can find the update equation. Q(0-12-12-1) = 28 + 60/63(33 + 9/10*34 – 28) = 62. Therefore, an agent now becomes much more likely to choose this action the next time it finds itself in this state. The same principle works in reverse. An agent in 2-0-8 will find itself losing at least 14 every round by taking the do-nothing action. A build choice would worsen the problem, and this Q-value for build would steadily worsen every time it was chosen. However, the sell action would net the agent 25 – costs, and so on average over time agents would converge towards selling in this case.
3. Experimental Evaluation
3.2 Results
We found our learning algorithm to be most successful in the training of a three agent oil market. Oil production approached and just exceeded that of the control market.
3.3 Discussion
4. Related Work
5. Future Work Given the ambiguous course of events resulting from any given action, it would be desirable to have a generalized function representing the best choice given some state. Because there are so many possible states, a function stored in a neural network or other structure would likely lead to good results. Again, this additional feature could result in changed behavior but we were not able to include it in the time provided.
6. Conclusion
|