Thinking about building your own in-house text analytics system? Here are some possible pitfalls to consider.

I am often told by Housing Associations prospects that they understand the value of analysing free form data using text analytics and are thinking about developing their own system in-house using a variety of available tools such as Microsoft fuzzy lookup and Python. Over the years we have seen a few organisation try and return to us once they realise the significance of what they are taking on. Don’t get me wrong it is possible to achieve some level of accuracy and results from an in-house developed product but I have listed below some of the challenges, which need to be considered.


  1. The processing power required to do the initial labelling of words and phrases is quite considerable and has been known to crash systems – you will therefore need to take advantage of cloud processing from (say) AWS / Azure etc. which comes with its own cost and management overhead
  2. Labelling of the words and phrases is actually the simple part, ordering them into topic groups that make sense to the business is more tricky – which topics  are important, how should they be categorised etc. We have for instance  built out 472 Housing Sector specific topics by working with other HA’s so we know these are relevant and important to the sector.
  3. Applying a sentiment to the sentences. Sentences like “I  couldn’t recommend you to my friends and family” has one very negative connotation, until “more highly” is added to the end – then “I couldn’t  recommend you to my friends and family more highly” is very positive. Getting a contextual sentiment score is much more difficult than most  imagine.
  4. Contextual sentiment needs to account for things like slang, idioms, colloquialisms and sarcasm e.g. “great job XXXXX….. NOT”. This is actually very negative but we see some classing it as a positive as it says “great  job”. Admittedly we see less of this specific sarcasm but there was quite  a period where it was very common. Sarcasms will/do change so the solution needs to take that into account and most often needs human intervention.
  5. Problem words e.g. “outstanding” – in one context this is extremely positive, in another it is extremely negative. Machines need to make an     allowance for this.
  6. Training your AI requires an immense amount of data, not only to  process and see what the results look like but then to, “correctly” classified  data to check the results against. Invariably, this means there is a heavy human element to building the solution / refining algorithms.
  7. Outlying words may present a challenge for AI – for example, with our Health & Safety solution we pre-code all the cancer conditions, all the cancer treatments and all the numerous ways these can be misspelt. We also pre-code all the 1,000+ biocide products authorised for use in the UK and many other topics are also covered like this. This kind of coding is not just coding, it is the research that goes into looking up all these words, phrases, names etc. before you can even get to the coding part.
  8. Outlying phrases also present a challenge to AI – for example, the phrase “knocked-up” has been seen to mean a “knocked-up worktop or window” (repairs) OR as a   reference to a pregnancy “and my missus is knocked-up”. There are an     enormous amount of phrases just like this that we know all about having spent 20+ years doing this.
  9. Language  and situations change so the system requires constant maintenance. Topical words such as Coronavirus – we now see 84 different ways this one topic / word is  spelt and then mis-spelt.
  10. Constant refinement is required, despite having built a system over the past 20 years we are still amazed at the constant changes. As it is an ever-moving target, we employ a team of “lexicon” agents (here in the UK) to find the anomalies and constantly code new words and phrases. Machines are not good at this e.g. when Covid started in 2020 we had never seen a mention of this. By end of February we started to see references to it. As we have people who go home  at night, see the terrible news reports that were coming in from Italy at the time, we immediately started coding and before the end of February was already reporting on this to our clients. Remember, lockdown did not  commence until 23rd March in the UK so our customers were  already getting insights before they even conceived there might be a lasting impact to their organisations from the pandemic.
  11. Specific data manipulation: Assuming your data teams can manipulate the data sufficiently to give a generic feel, they now need to build out rules for     specific use cases. In our HHSRS (resident Health & Safety) solution, we look for a combination of topics e.g. excess cold AND elderly, cancer patients, cardio issues, mental health issues, children etc. excess cold can be as random as “thermostat broken”. To give the HHSRS views we needed to code 1,672 topics – each of which have hundreds or thousands of words and phrases associated (along with misspellings) with them which turns into hundreds of millions or even billions of lookups. In this way, we ensure accuracy.
  12. Reporting  – Once you have all the data labelled and segmented by topics and  sentiment – what views make sense to the various different stakeholders     across the business. The repairs team want different views to the customer service team and different to the Health & Safety team, wellbeing only  want to hear from vulnerable customers and the board will want to see how the outcomes and initiatives align to the company values. There will be many other requirements besides which need to be built / supported / maintained. We have the vast majority of those off-the-shelf ready to go.


If you would like to chat more about using text analytics for your housing association resident feedback contact us on

Ian Dean, Sales Director