Analytics, visualization and algorithm challenges focus on finding better ways to interpret or communicate data. The federal government has been working to increase data sharing and make data sets more open and easy to combine. This increases the potential value of government information, and challenges are a powerful way to realize that potential.
Challenges address these goals in many ways:
- They reward exploration of newly available data sets and most effective uses or combinations of information (e.g., EPA Apps for the Environment Challenge).
- They ask for visual design that helps people better understand complex relationships.
- They find the best way to answer a specific question given specific data (e.g., ).
These challenges attract solvers intrigued by a specific topic and those looking for new ways to apply their statistical or design expertise. As you put together your challenge, you may want to find ways to reach out to both of these communities or focus on one of them. There are several challenge platforms devoted exclusively to statistical analysis, machine learning and coding. These platforms specialize in finding fast solutions to difficult analytic problems.
Analytics, visualization and algorithm challenges are typically, but not exclusively, posed to communities of solvers that specialize in developing software algorithms, “big-data” analytics and data visualizations. Many of these challenges use large and often unstructured data sets, or “big data.” These big-data challenges are used to develop algorithms that are predictive, algorithms that can detect and discover complex patterns or visualizations that effectively convey information extracted from the data. These large data sets can consist of anything, including historical system performance data, scientific data and special or time-lapsed imagery. With modern data collection and storage systems, these data sets can be enormous. These algorithms include traditional algorithmic approaches as well as newer machine-learning techniques. In some areas, machine-learning techniques are combined in sequential processes with crowd-based analytics, in which crowdsourcing techniques employ humans to provide an analytic step that is uniquely served by human intelligence. These new “chained” processes are being used to allow for much more complex and efficient processing of big data than some traditional methods.
For most algorithm challenges, the challenge development phase involves much more than simply developing the challenge statement. For most cases, a platform is provided for the solvers to provide their solutions in a way that can be tested and scored. This includes hosting portions of the data that the competitors need to test their algorithms. The scoring function for your challenge must be developed in a way that can be tested and must reflect your priorities in solving the problem, such as performance versus accuracy versus robustness. Normally, scores are posted during the challenge so that the competitors understand the current “score to beat.”
Challenges can be constructed with submissions being private or public, which allows other competitors to view and improve upon a given solution. Different challenge platforms use different approaches and there are pros and cons to each with regard to participation incentives, effectiveness and intellectual property protection. Most commercial platforms that specialize in these kinds of challenges have a process for selecting which method will be most effective for a given challenge. Many of these challenges are run as sprints or “marathon matches” as a way to get focused competition applied to the challenge. This may be a cultural attribute of coding communities, but it results in contests that are relatively short, generally one to three weeks.These challenges also require a heavier emphasis on scientific validation of the solutions. This often requires that each of the top scoring algorithms be run against the verification data set that was withheld from the contestants. This may result in a different winner order as the solutions are assessed on different measurements, such as performance, accuracy and robustness.The work up front for these challenges, which includes preparing the data and developing and testing scoring functions, can take one to three months. This is quite a bit longer than some other challenge types. However, when combined with the short sprint challenge period, the entire project still ends up with a timeline similar to other more traditional innovation challenges, which is about three to six months.
While much of this description has focused on specialized community challenges, it is also possible to use either standard problem-solving communities or the a more general audience to perform these types of challenges. This may be particularly useful for visualization challenges or challenges that seek innovation in the approach to the algorithm, visualization or analytics. Because visualization challenges depend on information presentation skills as much as algorithm development, both communities can often participate. In this case, it is important to provide the entire community with the requirements and necessary tools, which may includerendering or analytics APIs, so the solutions can be evaluated fairly and efficiently.
Be sure your problem is well understood and articulated before developing the challenge. For challenges of this type, you also need to define your evaluation method as part of the problem identification step. This ensures that you can identify good solutions and avoid having contestants game the system. Your evaluation criteria should be vetted by technical experts, as well as people with prior challenge experience. This is especially true for visualization challenges. If possible, the selection of the problem should involve a solution that has substantial impact. A well defined problem inspires participation from a diversity of creative people.
For data-driven algorithm or big-data challenges, the data must be in good shape. Check for significant gaps, and document them clearly. Unless data cleanup is itself a part of the challenge, make sure that the data set has as few errors as possible. Make sure the format is well defined, is consistent and includes necessary data and metadata. Metadata should be elaborated and robust enough to guide contestants who do not regularly work in the field of the problem. If using multiple data sets, make sure they map consistently to each other.
Make sure there is enough data for all stages of analysis or algorithm development and results sets are sufficient to be statistically meaningful. You’ll need enough data to provide a usable portion to the competitors, while also holding back enough for testing. Consider the different methods, such as machine learning, that solvers are likely to use and how much data is necessary to make them possible. Larger data sets also make it easier to avoid solutions that can be broadly applied.
Make sure sensitive data is removed, obscured or protected by a data use agreement. This could include personally identifiable information (PII), as well as data that is intellectual property (IP) sensitive or has restrictions on its release. Care and creativity must be used to keep the data set usable while meeting legal and ethical guidelines. Consider whether you want challenge participants or winners to continue to have access to data beyond the conclusion of the challenge.
Analytics, visualization and algorithm challenges often have special platform requirements like data storage, indexing or test-environment hosting. Depending on the type and complexity of your problem, consider using a community that specializes in these types of challenges.
The statistical and mathematical modeling involved in these types of challenges can lend themselves well to the applications of techniques from other fields. Using a challenge platform that can enhance outreach to communities in different subject matter areas but with similar skill sets can diversify the types of approaches brought to the challenge.
As with software challenges, it is important to consider how the final product will be used. This will dictate the requested format of submissions. For algorithms, this could affect the programming languages used, the format of the data or the hosting platform for the final product. Often, it is important to engage infrastructure groups such as IT or software development groups in the challenge definition and formulation phase so appropriate technical requirements can be considered and incorporated. This also ensures that the challenge owner has considered all of the costs involved—not just algorithm development, but integration, deployment and administration of any final deployed solution.
Some challenges have a final meeting or seminar to recognize winners and award prizes. This type of event can promote sharing of successful results with a community so many researchers and solvers benefit from the knowledge gained from the challenge. This type of meeting can also help partners in the U.S. government or private sector apply the solution.
As the challenge is being developed, follow-up opportunities should be considered.
There are many problems that require numerous attempts at solutions or a series of foundational breakthroughs. For example, different specific solutions could be combined into a broader and more powerful composite. For these problems, the challenge may be the first step in identifying where funding and effort should be directed. In these cases, it may be advisable to iteratively plan the challenge by defining interim performance goals. Alternatively, the first challenge may provide the background for a follow-up challenge. Considering these issues during the planning process helps ensure the results of the challenge have a lasting impact and adequate time can be devoted to pursuing the most effective solution.