Final Review · Group 6 — Identifying AI Risks for Non-Human Life in Urban Spaces

Identifying AI Risks for Non-Human Life in Urban Spaces · Final Review · Score 30/30Identifying AI Risks for Non-Human Life in Urban Spaces · Final Review · Score 30/30Identifying AI Risks for Non-Human Life in Urban Spaces · Final Review · Score 30/30Identifying AI Risks for Non-Human Life in Urban Spaces · Final Review · Score 30/30Identifying AI Risks for Non-Human Life in Urban Spaces · Final Review · Score 30/30Identifying AI Risks for Non-Human Life in Urban Spaces · Final Review · Score 30/30

Tackles a real and underexplored gap: systematic mapping of AI harms to non-human urban organisms, distinct from both anthropocentric AI-safety work and purely theoretical ecological frameworks.

The Log-Dampened Poisson sampling is a thoughtful, principled fix for citizen-observer bias instead of arbitrary quotas, and the eight-city selection is justified along four explicit diversity axes.

The three-step pipeline (generative elicitation → deterministic high-risk filter → counterfactual CQ1/CQ2 audit) is clearly described and reproducible, and the prompts are fully disclosed in the appendix.

Grounding risk categories in the established Coghlan framework and cross-referencing against EU AI Act tiers (Regulatory Blind Spot Index) gives external theoretical anchoring and a policy-relevant punchline.

Good transparency: open-sourced data, taxonomy, scripts, and labelled matrices; rich appendices with domain-level and trait-level breakdowns.

The structured-vs-naive ablation (3.50 vs 3.05) at least attempts to show the schema adds value.

−

The pipeline has no ground-truth: risks are generated by an LLM and validated by LLMs/authors, so the entire output is a model's hypothesis about ecology, never checked against field data or expert validation.

−

The human evaluation is underpowered and weak: n=30 pairs, Exact Agreement 31.7%, Cohen's κ=0.375 (only “fair”), and a +14.8% improvement reported with no significance test or confidence interval; both methods score mediocre (3.05 and 3.50 of 5).

−

Only the Social Desirability dimension was human-validated (50 organisms, 88% match); the other nine dimensions that drive every downstream result were assigned by GPT-5-mini with no reported validation.

−

Model sprawl (GPT-5-mini, Gemini Pro 3.1, GPT-4.1) is unjustified and untested for consistency, raising reproducibility concerns.

−

The headline Regulatory Blind Spot inversion is fragile: the “Prohibited” tier rests on n=2, and the severity gap elsewhere (2.82 vs 2.57) is small with no dispersion reported.

−

Multiple typos and a slightly rushed feel (“colelcted”, “Occurence”, “Wsing”, “the a diverse list”).

Review Nº 06

The Pros

The Cons