Ad hoc and general-purpose corpus construction from web sources