Phy-Q: A Benchmark for Physical Reasoning

Cheng Xue*, Vimukthini Pinto*, Chathura Gamage*
Ekaterina Nikonova, Peng Zhang, Jochen Renz
School of Computing
The Australian National University
Canberra, Australia
{cheng.xue, vimukthini.inguruwattage, chathura.gamage}@anu.edu.au
{ekaterina.nikonova, p.zhang, jochen.renz}@anu.edu.au

Humans are well-versed in reasoning about the behaviors of physical objects when choosing actions to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new benchmark that requires an agent to reason about physical scenarios and take an action accordingly. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. For each scenario, we create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. The benchmark gives a Phy-Q (physical reasoning quotient) score that reflects the physical reasoning ability of the agents. Our evaluation shows that 1) all agents fail to reach human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents with broad generalization abilities in physical domains.

* equal contribution

The research paper can be found here: https://arxiv.org/abs/2108.13696


Table of contents

  1. Physical Scenarios in Phy-Q
  2. Phy-Q in Angry Birds
  3. Task Generator
  4. Tasks Generted for the Baseline Analysis
  5. Baseline Agents
    1. How to Run Heuristic Agents
    2. How to Run Learning Agents
      1. How to Run DQN Baseline
      2. How to Run Stable Baselines
    3. How to Develop Your Own Agent
    4. Outline of the Agent Code
  6. Framework
    1. The Game Environment
    2. Symbolic Representation Data Structure
    3. Communication Protocols
  7. Human Player Data


1. Physical Scenarios in Phy-Q

We consider 15 physical scenarios in Phy-Q benchmark. Firstly, we consider the basic physical scenarios associated with applying forces directly on the target objects, i.e., the effect of a single force and the effect of multiple forces. On top of simple forces application, we also include the scenarios associated with more complex motion including rolling, falling, sliding, and bouncing, which are inspired by the physical reasoning capabilities developed in human infancy. Furthermore, we define the objects’ relative weight, the relative height, the relative width, the shape differences, and the stability scenarios, which require physical reasoning abilities infants acquire typically in a later stage. On the other hand, we also incorporate clearing path, adequate timing, and manoeuvring capabilities, and taking non-greedy actions, which are required to overcome challenges for robots to work safely and efficiently in physical environments. To sum up, the physical scenarios we consider and the corresponding physical rules that can use to achieve the goal of the associated tasks are:

  1. Single force: Some target objects can be destroyed with a single force.
  2. Multiple forces: Some target objects need multiple forces to destroy.
  3. Rolling: Circular objects can be rolled along a surface to a target.
  4. Falling: Objects can be fallen on to a target.
  5. Sliding: Non-circular objects can be slid along a surface to a target.
  6. Bouncing: Objects can be bounced off a surface to reach a target.
  7. Relative weight: Objects with correct weight need to be moved to reach a target.
  8. Relative height: Objects with correct height need to be moved to reach a target.
  9. Relative width: Objects with correct width or the opening with correct width should be selected to reach a target.
  10. Shape difference: Objects with correct shape need to be moved/destroyed to reach a target.
  11. Non-greedy actions: Actions need to be selected in the correct order based on physical consequences. The immediate action may be less effective in the short term but advantageous in long term. i.e., reach less targets in the short term to reach more targets later.
  12. Structural analysis: The correct target needs to be chosen to break the stability of a structure.
  13. Clearing paths: A path needs to be created before the target can be reached.
  14. Adequate timing: Correct actions need to be performed within time constraints.
  15. Manoeuvring: Powers of objects need to be activated correctly to reach a target.

2. Phy-Q in Angry Birds

Based on the above physical scenarios, we develop Phy-Q benchmark in Angry Birds. Phy-Q contains tasks from 75 task templates belonging to the fifteen scenarios. The goal of an agent is to destroy all the pigs (green-coloured objects) in the tasks by shooting a given number of birds from the slingshot. Shown below are fifteen example tasks in Phy-Q representing the fifteen scenarios and the solutions for those tasks.

Task Description
1. Single force: A single force is needed to be applied to the pig to destroy it by a direct bird shot.
2. Multiple forces: Multiple forces are needed to be applied to destroy the pig by multiple bird shots.
3. Rolling: The circular object is needed to be rolled onto the pig, which is unreachable for the bird from the slingshot, causing the pig to be destroyed.
4. Falling: The circular object is needed to be fallen onto the pig causing the pig to be destroyed.
5. Sliding: The square object is needed to be slid to hit the pig, which is unreachable for the bird from the slingshot, causing the pig to be destroyed.
6. Bouncing: The bird is needed to be bounced off the platform (dark-brown object) to hit and destroy the pig.
7. Relative weight: The small circular block is lighter than the big circular block. Out of the two blocks, the small circular block can only be rolled to reach the pig and destroy.
8. Relative height: The square block on top of the taller rectangular block will not fall through the gap due to the height of the rectangular block. Hence the square block on top of the shorter rectangular block needs to be toppled to fall through the gap and destroy the pig.
9. Relative width: The bird cannot go through the lower entrance which has a narrow opening. Hence the bird is needed to be shot to the upper entrance to reach the pig and destroy it.
10. Shape difference: The circular block on two triangle blocks can be rolled down by breaking a one triangle block and the circular block on two square blocks cannot be rolled down by breaking a one square block. Hence, the triangle block needs to be destroyed to make the circular block roll and fall onto the pig causing the pig to be destroyed.
11. Non-greedy actions: A greedy action tries to destroy the highest number of pigs in a single bird shot. If the two pigs resting on the circular block are destroyed, then the circular block will roll down and block the entrance to reach the below pig. Hence, the below pig is needed to be destroyed first and then the upper two pigs.
12. Structural analysis: The bird is needed to be shot at the weak point of the structure to break the stability and destroy the pigs. Shooting elsewhere does not destroy the two pigs with a single bird shot.
13. Clearing paths: First, the rectangle block is needed to be positioned correctly to open the path for the circular block to reach the pig. Then the circular block is needed to be rolled to destroy the pig.
14. Adequate timing: First, the two circular objects are needed to be rolled to the ramp. Then, after the first circle passes the prop and before the second circle reaches the prop, the prop needs to be destroyed to make the second circle fall onto the lower pig.
15. Manoeuvring: The blue bird splits into three other birds when it is tapped in the flight. The blue bird is needed to be tapped at the correct position to manoeuvre the birds to reach the two separated pigs.

Sceenshots of the 75 task templates are shown below. x.y represents the yth template of the xth scenario. The indexes of the scenarios are: 1. single force, 2. multiple forces, 3. rolling, 4. falling, 5. sliding, 6. bouncing, 7. relative weight, 8. relative height, 9. relative width, 10. shape difference, 11. non-greedy actions, 12. structural analysis, 13. clearing paths, 14. adequate timing, and 15. manoeuvring:

1.1 1.2 1.3
1.4 1.5 2.1
2.2 2.3 2.4
2.5 3.1 3.2
3.3 3.4 3.5
3.6 4.1 4.2
4.3 4.4 4.5
5.1 5.2 5.3
5.4 5.5 6.1
6.2 6.3 6.4
6.5 6.6 7.1
7.2 7.3 7.4
7.5 8.1 8.2
8.3 8.4 9.1
9.2 9.3 9.4
10.1 10.2 10.3
10.4 11.1 11.2
11.3 11.4 11.5
12.1 12.2 12.3
12.4 12.5 12.6
13.1 13.2 13.3
13.4 13.5 14.1
14.2 15.1 15.2
15.3 15.4 15.5
15.6 15.7 15.8

3. Task Generator

We develop a task generator that can generate tasks for the task templates we designed for each scenario.

  1. To run the task generator:
    1. Go to tasks/task_generator
    2. Copy the task templates that you want to generate tasks into the input (level templates can be found in tasks/task_templates)
    3. Run the task generator providing the number of tasks as an argument

    <div class="snippet-clipboard-content position-relative" data-snippet-clipboard-copy-content=" python generate_tasks.py
    “>

       python generate_tasks.py