Big Data: analytics and solutions. What is Big Data? Big data big data

You know this famous joke, right? Big Data is like sex before 18:

  • everyone thinks about it;
  • everyone talks about it;
  • everyone thinks their friends do it;
  • almost no one does this;
  • whoever does it does it badly;
  • everyone thinks it will work out better next time;
  • no one takes security measures;
  • anyone is ashamed to admit that they don’t know something;
  • if someone succeeds at something, there is always a lot of noise about it.

But let's be honest, with any hype there will always be the usual curiosity: what kind of fuss is there and is there something really important there? In short, yes, there is. Details are below. We have selected for you the most amazing and interesting applications of Big Data technologies. This a little research market, using clear examples, confronts a simple fact: the future does not come, there is no need to “wait another n years and the magic will become reality.” No, it has already arrived, but is still invisible to the eye and therefore the burning of the singularity has not yet burned a certain point of the labor market so much. Go.

1 How Big Data technologies are applied where they originated

Large IT companies are where data science originated, so their internal knowledge in this area is the most interesting. Campaign Google, the birthplace of the Map Reduce paradigm, whose sole purpose is to train its programmers in machine learning technologies. And this is where their competitive advantage lies: after acquiring new knowledge, employees will introduce new methods in those Google projects where they constantly work. Imagine how huge the list of areas in which a campaign can revolutionize is. One example: neural networks are used.

Corporation and implements machine learning in all of its products. Its advantage is the presence of a large ecosystem that includes everyone digital devices, used in everyday life. This allows Apple to reach an impossible level: the campaign has more user data than any other. At the same time, the privacy policy is very strict: the corporation has always boasted that it does not use customer data for advertising purposes. Accordingly, user information is encrypted so that Apple lawyers or even the FBI with a warrant cannot read it. By you will find great review Apple's developments in the field of AI.

2 Big Data on 4 wheels

A modern car is an information store: it accumulates all the data about the driver, the environment, connected devices and itself. One thing is coming soon vehicle, which is connected to a network like the one, will generate up to 25 GB of data per hour.

Vehicle telematics has been used by automakers for many years, but there is now lobbying for a more sophisticated data collection method that takes full advantage of Big Data. This means that technology can now notify the driver of bad road conditions by automatic activation anti-lock braking and traction system.

Other companies, including BMW, are using Big Data technology, combined with information collected from prototypes being tested, in-vehicle error memory systems, and customer complaints, to identify model weaknesses early in production. Now, instead of manually evaluating data, which takes months, a modern algorithm is used. Errors and troubleshooting costs are reduced, which speeds up information analysis workflows at BMW.

According to expert estimates, by 2019 the market turnover of connected cars will reach $130 billion. This is not surprising, given the pace of integration by automakers of technologies that are an integral part of the vehicle.

Using Big Data helps make the car safer and more functional. Thus, Toyota by integrating information communication modules (DCM). This Big Data tool processes and analyzes the data collected by DCM to further extract value from it.

3 Application of Big Data in medicine


The implementation of Big Data technologies in the medical field allows doctors to study the disease more thoroughly and choose an effective course of treatment for a particular case. Thanks to the analysis of information, it becomes easier for health workers to predict relapses and take preventive measures. The result is a more accurate diagnosis and improved treatment methods.

The new technique allowed us to look at patients' problems from a different perspective, which led to the discovery of previously unknown sources of the problem. For example, some races are genetically more predisposed to heart disease than other ethnic groups. Now, when a patient complains of a certain disease, doctors take into account data about members of his race who complained of the same problem. Collection and analysis of data allows us to learn much more about patients: from food preferences and lifestyle to the genetic structure of DNA and metabolites of cells, tissues, and organs. Thus, the Center for Children's Genomic Medicine in Kansas City uses patients and analyzes the mutations in the genetic code that cause cancer. An individual approach to each patient, taking into account his DNA, will raise the effectiveness of treatment to a qualitatively different level.

Understanding how Big Data is used is the first and very important change in the medical field. When a patient undergoes treatment, the hospital or other health care facility may receive a lot meaningful information about a human. The collected information is used to predict disease recurrences with a certain degree of accuracy. For example, if a patient has suffered a stroke, doctors study information about the time of cerebrovascular accident, analyze the intermediate period between previous precedents (if any), paying special attention to stressful situations and heavy physical activity in the patient’s life. Based on this data, hospitals provide the patient with a clear action plan to prevent the possibility of a stroke in the future.

Wearable devices also play a role, helping to identify health problems even if a person does not have obvious symptoms of a particular disease. Instead of assessing the patient’s condition through a long course of examinations, the doctor can draw conclusions based on the information collected by a fitness tracker or smart watch.

One of the latest examples is . While the man was being examined for a new seizure caused by a missed medication, doctors discovered that the man had a much more serious health problem. This problem turned out to be atrial fibrillation. The diagnosis was made thanks to the fact that the department staff gained access to the patient’s phone, namely to the application associated with his fitness tracker. Data from the application turned out to be a key factor in determining the diagnosis, because at the time of the examination, no cardiac abnormalities were detected in the man.

This is just one of the few cases that shows why use big data plays such a significant role in the medical field today.

4 Data analysis has already become the core of retail

Understanding user queries and targeting is one of the largest and most publicized areas of application of Big Data tools. Big Data helps analyze customer habits in order to better understand consumer needs in the future. Companies are looking to expand the traditional data set with information from social networks and browser search history in order to create the most complete customer picture possible. Sometimes large organizations choose to create their own predictive model as a global goal.

For example, the Target store chain, using in-depth data analysis and its own forecasting system, manages to determine with high accuracy - . Each client is assigned an ID, which in turn is linked to a credit card, name or email. The identifier serves as a kind of shopping cart, where information about everything that a person has ever purchased is stored. Network specialists have found that pregnant women actively purchase unscented products before the second trimester of pregnancy, and during the first 20 weeks they rely on calcium, zinc and magnesium supplements. Based on the data received, Target sends coupons for baby products to customers. The discounts on goods for children themselves are “diluted” with coupons for other products, so that offers to buy a crib or diapers do not look too intrusive.

Even government departments have found a way to use Big Data technologies to optimize election campaigns. Some believe that Barack Obama's victory in the 2012 US presidential election was due to the excellent work of his team of analysts, who processed huge amounts of data in the right way.

5 Big Data protects law and order


Over the past few years, law enforcement agencies have been able to figure out how and when to use Big Data. It is a well-known fact that the National Security Agency uses Big Data technologies to prevent terrorist attacks. Other departments are using advanced methodology to prevent smaller crimes.

The Los Angeles Police Department uses . She does what is commonly called proactive policing. Using crime reports over a period of time, the algorithm identifies areas where crime is most likely to occur. The system marks such areas on the city map with small red squares and this data is immediately transmitted to patrol cars.

Chicago cops use Big Data technologies in a slightly different way. Law enforcement officers in the Windy City do the same, but it is aimed at outlining a “risk circle” consisting of people who could be a victim or participant in an armed attack. According to The New York Times, this algorithm assigns a vulnerability rating to a person based on his criminal history (arrests and participation in shootings, membership of criminal groups). The system's developer says that while the system examines a person's criminal history, it does not take into account secondary factors such as a person's race, gender, ethnicity and location.

6 How Big Data technologies help cities develop


CEO Veniam Joao Barros demonstrates a map of tracking Wi-Fi routers on buses in Porto

Data analysis is also used to improve a number of aspects of the life of cities and countries. For example, knowing exactly how and when to use Big Data technologies, you can optimize traffic flows. To do this, the movement of cars online is taken into account, social media and meteorological data are analyzed. Today, a number of cities have committed themselves to using data analytics to combine transport infrastructure with other types of public services into a single whole. This is the concept of a “smart” city, in which buses wait for late trains, and traffic lights are able to predict traffic congestion to minimize traffic jams.

Based on Big Data technologies, the city of Long Beach operates smart water meters that are used to stop illegal watering. Previously, they were used to reduce water consumption by private households (the maximum result was a reduction of 80%). Saving fresh water is always a pressing issue. Especially when the state is experiencing the worst drought ever recorded.

Representatives of the Los Angeles Department of Transportation have joined the list of those who use Big Data. Based on data received from traffic camera sensors, authorities monitor the operation of traffic lights, which in turn allows traffic regulation. The computerized system controls about 4,500 thousand traffic lights throughout the city. According to official data, new algorithm helped reduce congestion by 16%.

7 The engine of progress in marketing and sales


In marketing, Big Data tools make it possible to identify which ideas are most effective in promoting at a particular stage of the sales cycle. Data analysis determines how investments can improve customer relationship management, what strategy should be adopted to improve conversion rates, and how to optimize the customer lifecycle. In cloud businesses, Big Data algorithms are used to figure out how to minimize the cost of customer acquisition and increase customer lifecycle.

Differentiation of pricing strategies depending on the intra-system level of the client is perhaps the main thing for which Big Data is used in the field of marketing. McKinsey found that about 75% of the average firm's revenue comes from core products, 30% of which are mispriced. A 1% increase in price results in an 8.7% increase in operating profit.

The Forrester research team found that data analytics allows marketers to focus on how to make customer relationships more successful. By examining the direction of customer development, specialists can assess the level of their loyalty, as well as extend the life cycle in the context of a specific company.

Optimization of sales strategies and stages of entering new markets using geo-analytics are reflected in the biopharmaceutical industry. According to McKinsey, drug manufacturing companies spend an average of 20 to 30% of profits on administration and sales. If enterprises become more active use Big Data to identify the most profitable and fastest growing markets, costs will be reduced immediately.

Data analytics is a means for companies to gain a complete picture of key aspects of their business. Increasing revenue, reducing costs and reducing working capital are three challenges that modern businesses are trying to solve with the help of analytical tools.

Finally, 58% of marketing directors claim that the implementation of Big Data technologies can be traced to search engine optimization(SEO), e-mail and mobile marketing, where data analysis plays the most significant role in the formation of marketing programs. And only 4% fewer respondents are confident that Big Data will play a significant role in all marketing strategies for many years to come.

8 Global data analysis

No less curious is... It is possible that machine learning will ultimately be the only force capable of maintaining the delicate balance. The topic of human influence on global warming still causes a lot of controversy, so only reliable predictive models based on the analysis of large amounts of data can give an accurate answer. Ultimately, reducing emissions will help us all: we will spend less on energy.

Now Big Data is not an abstract concept that may find its application in a couple of years. This is a completely working set of technologies that can be useful in almost all areas of human activity: from medicine and public order to marketing and sales. The stage of active integration of Big Data into our daily life has just begun, and who knows what the role of Big Data will be in a few years?

Big data (or Big Data) is a set of methods for working with huge volumes of structured or unstructured information. Big data specialists process and analyze it to obtain visual, human-perceivable results. Look At Me talked to professionals and found out what the situation is with big data processing in Russia, where and what is best to study for those who want to work in this field.

Alexey Ryvkin about the main trends in the field of big data, communication with customers and the world of numbers

I studied at the Moscow Institute of Electronic Technology. The main thing I managed to take away from there was fundamental knowledge in physics and mathematics. Simultaneously with my studies, I worked at the R&D center, where I was involved in the development and implementation of noise-resistant coding algorithms for secure data transmission. After finishing my bachelor's degree, I entered the master's program in business informatics at the Higher School of Economics. After that I wanted to work at IBS. I was lucky that at that time, due to big amount projects were going on additional set interns, and after several interviews I started working at IBS, one of the largest Russian companies in this field. In three years, I went from an intern to an enterprise solutions architect. Currently I am developing expertise in Big Data technologies for customer companies from the financial and telecommunications sectors.

There are two main specializations for people who want to work with big data: analysts and IT consultants who create technologies to work with big data. In addition, we can also talk about the profession of Big Data Analyst, i.e. people who directly work with data, with the customer’s IT platform. Previously, these were ordinary mathematical analysts who knew statistics and mathematics and used statistical software to solve data analysis problems. Today, in addition to knowledge of statistics and mathematics, an understanding of technology and the data life cycle is also necessary. This, in my opinion, is the difference between modern Data Analysts and those analysts who came before.

My specialization is IT consulting, that is, I come up with and offer clients ways to solve business problems using IT technologies. People with different experiences come to consulting, but the most important qualities for this profession are the ability to understand the needs of the client, the desire to help people and organizations, good communication and team skills (since it is always working with the client and in a team), good analytical skills. Internal motivation is very important: we work in a competitive environment, and the customer expects unusual solutions and interest in work.

Most of my time is spent communicating with customers, formalizing their business needs and helping them develop the most suitable technology architecture. The selection criteria here have their own peculiarity: in addition to functionality and TCO (Total cost of ownership) non-functional requirements for the system are very important, most often these are response time and information processing time. To convince the customer, we often use a proof of concept approach - we offer to “test” the technology for free on some task, on a narrow set of data, to make sure that the technology works. The solution should create a competitive advantage for the customer by obtaining additional benefits (for example, x-sell, cross-selling) or solve some problem in business, say, reduce high level loan fraud.

It would be much easier if clients came with a ready-made task, but so far they do not understand that a revolutionary technology has appeared that can change the market in a couple of years

What problems do you face? The market is not yet ready to use big data technologies. It would be much easier if clients came with a ready-made task, but so far they do not understand that a revolutionary technology has appeared that can change the market in a couple of years. This is why we essentially work in startup mode - we don’t just sell technologies, but every time we convince clients that they need to invest in these solutions. This is the position of visionaries - we show customers how they can change their business using data and IT. We create this new market- market for commercial IT consulting in the field of Big Data.

If a person wants to engage in data analysis or IT consulting in the field of Big Data, then the first thing that is important is a mathematical or technical education with good mathematical training. It is also useful to master specific technologies, for example SAS, Hadoop, R language or IBM solutions. In addition, you need to be actively interested in applications for Big Data - for example, how it can be used for improved credit scoring in a bank or management life cycle client. This and other knowledge can be obtained from available sources: for example, Coursera and Big Data University. There is also the Customer Analytics Initiative at Wharton University of Pennsylvania, where a lot of interesting materials have been published.

A major problem for those who want to work in our field is the clear lack of information about Big Data. You cannot go to a bookstore or some website and get, for example, a comprehensive collection of cases on all applications of Big Data technologies in banks. There are no such directories. Some of the information is in books, some is collected at conferences, and some you have to figure out on your own.

Another problem is that analysts are comfortable in the world of numbers, but they are not always comfortable in business. These people are often introverted and have difficulty communicating, making it difficult for them to communicate research findings convincingly to clients. To develop these skills, I would recommend books such as The Pyramid Principle, Speak the Language of Diagrams. They help develop presentation skills and express your thoughts concisely and clearly.

Participating in various case championships while studying at the National Research University Higher School of Economics helped me a lot. Case championships are intellectual competitions for students where they need to study business problems and propose solutions to them. There are two types: case championships of consulting firms, for example, McKinsey, BCG, Accenture, as well as independent case championships such as Changellenge. While participating in them, I learned to see and solve complex problems - from identifying a problem and structuring it to defending recommendations for its solution.

Oleg Mikhalsky about the Russian market and the specifics of creating a new product in the field of big data

Before joining Acronis, I was already involved in launching new products to market at other companies. It is always interesting and difficult at the same time, so I was immediately interested in the opportunity to work on cloud services and storage solutions. All my previous experience in the IT industry, including my own startup project I-accelerator, came in handy in this area. Having a business education (MBA) in addition to a basic engineering degree also helped.

In Russia, large companies - banks, mobile operators etc. - there is a need for big data analysis, so in our country there are prospects for those who want to work in this area. True, many projects now are integration projects, that is, made on the basis of foreign developments or open source technologies. In such projects, fundamentally new approaches and technologies are not created, but rather existing developments are adapted. At Acronis, we took a different path and, after analyzing the available alternatives, decided to invest in our own development, resulting in a reliable storage system for big data that is not inferior in cost to, for example, Amazon S3, but works reliably and efficiently and on a significantly smaller scale. Own developments According to large data, large Internet companies also have them, but they are more focused on internal needs than meeting the needs of external customers.

It is important to understand the trends and economic forces that influence the field of big data. To do this, you need to read a lot, listen to speeches by authoritative experts in the IT industry, and attend thematic conferences. Now almost every conference has a section on Big Data, but they all talk about it from a different angle: from a technology, business or marketing point of view. You can go for project work or an internship at a company that is already leading projects on this topic. If you are confident in your abilities, then it is not too late to organize a startup in the field of Big Data.

Without constant contact with the market new development risks being unclaimed

True, when you are responsible for a new product, a lot of time is spent on market analytics and communication with potential clients, partners, and professional analysts who know a lot about clients and their needs. Without constant contact with the market, a new development risks being unclaimed. There are always a lot of uncertainties: you have to figure out who the early adopters will be, what you have to offer them, and how to then attract a mass audience. The second most important task is to formulate and convey to developers a clear and holistic vision final product, to motivate them to work in an environment where some requirements may still change and priorities depend on feedback from early customers. Therefore, an important task is managing the expectations of clients on the one hand and developers on the other. So that neither one nor the other loses interest and brings the project to completion. After the first successful project it becomes easier, and main task will find correct model growth for new business.

In the Russian-speaking environment it is used as a term Big Data, and the concept of “big data”. The term “big data” is a carbon copy of the English term. Big data does not have a strict definition. It is impossible to draw a clear line - is it 10 terabytes or 10 megabytes? The name itself is very subjective. The word “big” is like “one, two, many” among primitive tribes.

However, there is an established opinion that big data is a set of technologies that are designed to perform three operations. Firstly, process larger volumes of data compared to “standard” scenarios. Secondly, be able to work with quickly arriving data in very large volumes. That is, there is not just a lot of data, but it is constantly becoming more and more. Third, they must be able to work with structured and ill-structured data in parallel in different aspects. Big data assumes that algorithms receive a stream of information that is not always structured and that more than one idea can be extracted from it.

A typical example of big data is information coming from various physical experimental facilities - for example, with, which produces a huge amount of data and does so constantly. The installation continuously produces large volumes of data, and scientists use it to solve many problems in parallel.

The emergence of big data in the public space was due to the fact that this data affected almost all people, and not just the scientific community, where such problems have been solved for a long time. Into the public sphere of technology Big Data came out when we started talking about a very specific number - the number of inhabitants of the planet. 7 billion going to in social networks and other projects that aggregate people. YouTube, Facebook, In contact with, where the number of people is measured in billions, and the number of transactions they perform simultaneously is enormous. The data flow in this case is user actions. For example, data from the same hosting YouTube, which flow through the network in both directions. Processing refers not only to interpretation, but also to the ability to correctly process each of these actions, that is, to place it in Right place and make this data available to every user quickly, since social networks do not tolerate waiting.

Much of what concerns big data, the approaches that are used to analyze it, have actually been around for quite some time. For example, processing images from surveillance cameras, when we are not talking about one picture, but a stream of data. Or robot navigation. All this has existed for decades, but now data processing tasks have affected a much larger number of people and ideas.

Many developers are accustomed to working with static objects and thinking in terms of states. In big data the paradigm is different. You have to be able to work with a constant flow of data, and this is an interesting task. It affects more and more areas.

In our lives, more and more hardware and software are beginning to generate large amounts of data - for example, the Internet of Things.

Things are already generating huge flows of information. The Potok police system sends information from all cameras and allows you to find cars using this data. Fitness bracelets, GPS trackers and other things that serve the needs of individuals and businesses are becoming increasingly fashionable.

The Moscow Department of Informatization is recruiting a large number of data analysts, because a lot of statistics on people are accumulated and they are multi-criteria (that is, statistics on a very large number of criteria have been collected about each person, about each group of people). You need to find patterns and trends in this data. For such tasks, mathematicians with IT education are needed. Because ultimately the data is stored in structured DBMSs, and you need to be able to access them and obtain information.

Previously, we did not consider big data as a problem for the simple reason that there was no place to store it and there were no networks to transmit it. When these opportunities appeared, the data immediately filled the entire volume provided to them. But no matter how much bandwidth and data storage capacity are expanded, there will always be sources, for example, physical experiments, experiments on modeling the streamlining of a wing, which will produce more information than we can transmit. According to Moore's law, the performance of modern parallel computing systems is steadily increasing, and the speeds of data transmission networks are also increasing. However, data must be able to quickly save and retrieve from storage media ( hard drive and other types of memory), and this is another challenge in big data processing.

Big data - what is it in simple words

In 2010, the first attempts to solve the growing problem of big data began to appear. Were released software products, whose action was aimed at minimizing risks when using huge amounts of information.

By 2011, such large companies as Microsoft, Oracle, EMC and IBM became interested in big data - they became the first to use Big data developments in their development strategies, and quite successfully.

Universities began studying big data as a separate subject already in 2013 - now not only data science, but also engineering, coupled with computing subjects, deals with problems in this area.

The main methods of data analysis and processing include the following:

  1. Class methods or deep analysis (Data Mining).

These methods are quite numerous, but they have one thing in common: the mathematical tools used in conjunction with achievements from the field of information technology.

  1. Crowdsourcing.

This technique allows you to obtain data simultaneously from several sources, and the number of the latter is practically unlimited.

  1. A/B testing.

From the entire volume of data, a control set of elements is selected, which is alternately compared with other similar sets where one of the elements was changed. Conducting such tests helps determine which parameter fluctuations have the greatest impact on the control population. Thanks to the volume of Big Data, it is possible to carry out a huge number of iterations, with each of them getting closer to the most reliable result.

  1. Predictive analytics.

Specialists in this field try to predict and plan in advance how the controlled object will behave in order to make the most profitable decision in this situation.

  1. Machine learning (artificial intelligence).

It is based on empirical analysis of information and the subsequent construction of self-learning algorithms for systems.

  1. Network analysis.

The most common method for studying social networks is that after obtaining statistical data, the nodes created in the grid are analyzed, that is, the interactions between individual users and their communities.

Prospects and trends for the development of Big data

In 2017, when big data ceased to be something new and unknown, its importance not only did not decrease, but increased even more. Experts are now betting that big data analytics will become available not only to giant organizations, but also to small and medium-sized businesses. This approach is planned to be implemented using the following components:

  • Cloud storage.

Data storage and processing are becoming faster and more economical - compared to the costs of maintaining your own data center and possible expansion of staff, renting a cloud seems to be a much cheaper alternative.

  • Using Dark Data.

The so-called “dark data” is all non-digitized information about the company, which does not play a key role in its direct use, but can serve as a reason for switching to new format storage of information.

  • Artificial Intelligence and Deep Learning.

Machine intelligence learning technology, which imitates the structure and operation of the human brain, is ideally suited for processing large amounts of constantly changing information. In this case, the machine will do everything that a person would do, but the likelihood of error is significantly reduced.

  • Blockchain

This technology makes it possible to speed up and simplify numerous online transactions, including international ones. Another advantage of Blockchain is that it reduces transaction costs.

  • Self-service and reduced prices.

In 2017, it is planned to introduce “self-service platforms” - these are free platforms where representatives of small and medium-sized businesses can independently evaluate the data they store and systematize it.

The VISA company similarly used Big Data, tracking fraudulent attempts to perform a particular operation. Thanks to this, they save more than $2 billion annually from leakage.

The German Labor Ministry managed to cut costs by 10 billion euros by introducing a big data system into its work on issuing unemployment benefits. At the same time, it was revealed that a fifth of citizens receive these benefits without reason.

Big Data has not spared the gaming industry either. Thus, the World of Tanks developers conducted a study of information about all players and compared the available indicators of their activity. This helped predict the possible future outflow of players - based on the assumptions made, representatives of the organization were able to interact more effectively with users.

Notable organizations using big data also include HSBC, Nasdaq, Coca-Cola, Starbucks and AT&T.

Big Data problems

The biggest problem with big data is the cost of processing it. This can include both expensive equipment and wage costs for qualified specialists capable of servicing huge amounts of information. Obviously, the equipment will have to be updated regularly so that it does not lose minimum functionality as the volume of data increases.

The second problem is again related to the large amount of information that needs to be processed. If, for example, a study produces not 2-3, but a numerous number of results, it is very difficult to remain objective and select from the general flow of data only those that will have a real impact on the state of any phenomenon.

Big Data privacy problem. With most customer service services moving to online data usage, it is very easy to become the next target for cybercriminals. Even simply storing personal information without making any online transactions can be fraught with undesirable consequences for customers cloud storage consequences.

The problem of information loss. Precautionary measures require not to be limited to a simple one-time data backup, but to do at least 2-3 backups storage facilities. However, as the volume increases, difficulties with redundancy increase - and IT specialists are trying to find optimal solution this problem.

Big data technology market in Russia and the world

As of 2014, 40% of the big data market volume is made up of services. Revenue from the use of Big Data in computer equipment is slightly inferior (38%) to this indicator. The remaining 22% comes from software.

The most useful products in the global segment for solving Big Data problems, according to statistics, are In-memory and NoSQL analytical platforms. 15 and 12 percent of the market, respectively, are occupied by Log-file analytical software and Columnar platforms. But Hadoop/MapReduce in practice cope with big data problems not very effectively.

Results of implementing big data technologies:

  • increasing the quality of customer service;
  • optimization of supply chain integration;
  • optimization of organization planning;
  • acceleration of interaction with clients;
  • increasing the efficiency of processing customer requests;
  • reduction in service costs;
  • optimization of processing client requests.

Best books on Big Data

"The Human Face of Big Data" by Rick Smolan and Jennifer Erwitt

Suitable for initial study of big data processing technologies - it introduces you easily and clearly. Makes it clear how the abundance of information has influenced everyday life and all its spheres: science, business, medicine, etc. Contains numerous illustrations, so it is perceived without much effort.

"Introduction to Data Mining" by Pang-Ning Tan, Michael Steinbach and Vipin Kumar

Also useful for beginners is a book on Big Data, which explains working with big data according to the principle “from simple to complex.” Covers many important points at the initial stage: preparation for processing, visualization, OLAP, as well as some methods of data analysis and classification.

"Python Machine Learning" by Sebastian Raschka

A practical guide to using and working with big data using the Python programming language. Suitable for both engineering students and professionals who want to deepen their knowledge.

"Hadoop for Dummies", Dirk Derus, Paul S. Zikopoulos, Roman B. Melnik

Hadoop is a project created specifically for working with distributed programs that organize the execution of actions on thousands of nodes simultaneously. Getting to know it will help you understand in more detail the practical application of big data.

It was predicted that the total global volume of data created and replicated in 2011 could be about 1.8 zettabytes (1.8 trillion gigabytes) - about 9 times more than what was created in 2006.

More complex definition

However` big data` involve more than just analyzing huge amounts of information. The problem is not that organizations create huge amounts of data, but that most of it is presented in a format that does not fit well with the traditional structured database format - web logs, videos, text documents, machine code or, for example, geospatial data. All this is stored in many different repositories, sometimes even outside the organization. As a result, corporations may have access to a huge amount of their data and lack the necessary tools to establish relationships between this data and draw meaningful conclusions from it. Add to this the fact that data is now being updated more and more frequently, and you get a situation in which traditional methods of information analysis cannot keep up with the huge volumes of constantly updated data, which ultimately opens the way for technology big data.

Best definition

In essence the concept big data involves working with information of a huge volume and diverse composition, very often updated and located in different sources in order to increase operational efficiency, create new products and increase competitiveness. The consulting company Forrester gives a brief formulation: ` Big Data brings together techniques and technologies that extract meaning from data at the extreme limits of practicality.

How big is the difference between business analytics and big data?

Craig Bathy, executive director of marketing and chief technology officer of Fujitsu Australia, pointed out that business analysis is a descriptive process of analyzing the results achieved by a business in a certain period of time, while the processing speed big data allows you to make the analysis predictive, capable of offering business recommendations for the future. Big data technologies also allow you to analyze more types of data than business intelligence tools, which makes it possible to focus on more than just structured repositories.

Matt Slocum of O'Reilly Radar believes that although big data and business analytics have the same goal (finding answers to a question), they differ from each other in three aspects.

  • Big data is designed to handle larger volumes of information than business analytics, and this certainly fits the traditional definition of big data.
  • Big data is designed to handle faster, faster-changing information, which means deep exploration and interactivity. In some cases, results are generated faster than the web page loads.
  • Big data is designed to process unstructured data that we are only beginning to explore how to use once we have been able to collect and store it, and we need algorithms and conversational capabilities to make it easier to find trends contained within these data sets.

According to the white paper published by Oracle ` Information architecture Oracle: An Architect's Guide to Big Data When working with big data, we approach information differently than when conducting business analysis.

Working with big data is not like the usual business intelligence process, where simply adding up known values ​​produces a result: for example, adding up paid invoices becomes sales for the year. When working with big data, the result is obtained in the process of cleaning it through sequential modeling: first, a hypothesis is put forward, a statistical, visual or semantic model is built, on its basis the accuracy of the put forward hypothesis is checked, and then the next one is put forward. This process requires the researcher to either interpret visual meanings or construct interactive queries based on knowledge, or develop adaptive `machine learning` algorithms that can produce the desired result. Moreover, the lifetime of such an algorithm can be quite short.

Big data analysis techniques

There are many different methods for analyzing data sets, which are based on tools borrowed from statistics and computer science (for example, machine learning). The list does not pretend to be complete, but it reflects the most popular approaches in various industries. It should be understood that researchers continue to work on creating new techniques and improving existing ones. In addition, some of the techniques listed do not necessarily apply exclusively to big data and can be successfully used for smaller arrays (for example, A/B testing, regression analysis). Of course, the more voluminous and diversified the array is analyzed, the more accurate and relevant data can be obtained as a result.

A/B testing. A technique in which a control sample is alternately compared with others. Thus, it is possible to identify the optimal combination of indicators to achieve, for example, the best consumer response to a marketing offer. Big Data allow you to carry out a huge number of iterations and thus obtain a statistically reliable result.

Association rule learning. A set of techniques for identifying relationships, i.e. association rules between variables in large data sets. Used in data mining.

Classification. A set of techniques that allows you to predict consumer behavior in a certain market segment (purchase decisions, churn, consumption volume, etc.). Used in data mining.

Cluster analysis. A statistical method for classifying objects into groups by identifying common features that are not known in advance. Used in data mining.

Crowdsourcing. Methodology for collecting data from large quantity sources.

Data fusion and data integration. A set of techniques that allows you to analyze comments from social network users and compare them with sales results in real time.

Data mining. A set of techniques that allows you to determine the categories of consumers most susceptible to the promoted product or service, identify the characteristics of the most successful employees, and predict the behavioral model of consumers.

Ensemble learning. This method uses many predictive models, thereby improving the quality of the forecasts made.

Genetic algorithms. In this technique possible solutions represented as 'chromosomes' that can combine and mutate. As in the process of natural evolution, the fittest individual survives.

Machine learning. A direction in computer science (historically it has been given the name “artificial intelligence”), which pursues the goal of creating self-learning algorithms based on the analysis of empirical data.

Natural language processing (NLP). A set of techniques for recognizing natural human language borrowed from computer science and linguistics.

Network analysis. A set of techniques for analyzing connections between nodes in networks. In relation to social networks, it allows you to analyze the relationships between individual users, companies, communities, etc.

Optimization. A set of numerical methods for redesigning complex systems and processes to improve one or more metrics. Helps in making strategic decisions, for example, the composition of the product line to be launched on the market, conducting investment analysis, etc.

Pattern recognition. A set of techniques with self-learning elements for predicting the behavioral model of consumers.

Predictive modeling. A set of techniques that allow you to create mathematical model a predetermined probable scenario for the development of events. For example, analysis of the CRM system database for possible conditions that will prompt subscribers to change providers.

Regression. A set of statistical methods for identifying a pattern between changes in a dependent variable and one or more independent variables. Often used for forecasting and predictions. Used in data mining.

Sentiment analysis. Techniques for assessing consumer sentiment are based on natural language recognition technologies. They allow you to isolate messages related to the subject of interest (for example, a consumer product) from the general information flow. Next, evaluate the polarity of the judgment (positive or negative), the degree of emotionality, etc.

Signal processing. A set of techniques borrowed from radio engineering that aims to recognize a signal against a background of noise and its further analysis.

Spatial analysis. A set of methods for analyzing spatial data, partly borrowed from statistics - terrain topology, geographic coordinates, object geometry. Source big data in this case they often appear geographic information systems(GIS).

  • Revolution Analytics (based on the R language for mathematical statistics).

Of particular interest in this list is Apache Hadoop - software with open source, which has been tested as a data analyzer by most stock trackers over the past five years. As soon as Yahoo opened the Hadoop code to the open source community, a whole movement of creating products based on Hadoop immediately appeared in the IT industry. Almost everything modern means analysis big data provide Hadoop integration tools. Their developers are both startups and well-known global companies.

Markets for Big Data Management Solutions

Big Data Platforms (BDP, Big Data Platform) as a means of combating digital hording

Ability to analyze big data, colloquially called Big Data, is perceived as a benefit, and unambiguously. But is this really so? What could the unbridled accumulation of data lead to? Most likely to what domestic psychologists, in relation to humans, call pathological hoarding, syllogomania, or figuratively “Plyushkin syndrome.” In English, the vicious passion to collect everything is called hording (from the English hoard - “stock”). According to the classification of mental illnesses, hording is classified as a mental disorder. In the digital era, digital hoarding is added to the traditional material hording; it can affect both individuals and entire enterprises and organizations ().

World and Russian market

Big data Landscape - Main suppliers

Interest in collection, processing, management and analysis tools big data Almost all leading IT companies showed this, which is quite natural. Firstly, they directly encounter this phenomenon in their own business, and secondly, big data open up excellent opportunities for developing new market niches and attracting new customers.

Many startups have appeared on the market that make business by processing huge amounts of data. Some of them use ready-made cloud infrastructure provided by large players like Amazon.

Theory and practice of Big Data in industries

History of development

2017

TmaxSoft forecast: the next “wave” of Big Data will require modernization of the DBMS

Businesses know that the vast amounts of data they accumulate contain important information about their business and customers. If a company can successfully apply this information, it will have a significant advantage over its competitors and will be able to offer better products and services than theirs. However, many organizations still fail to effectively use big data due to the fact that their legacy IT infrastructure is unable to provide the necessary storage capacity, data exchange processes, utilities and applications required to process and analyze large amounts of unstructured data to extract valuable information from them, TmaxSoft indicated.

Additionally, the increased processing power needed to analyze ever-increasing volumes of data may require significant investment in an organization's legacy IT infrastructure, as well as additional maintenance resources that could be used to develop new applications and services.

On February 5, 2015, the White House released a report that discussed how companies are using " big data» to charge different prices to different customers, a practice known as “price discrimination” or “personalized pricing”. The report describes the benefits of big data for both sellers and buyers, and its authors conclude that many of the issues raised by big data and differential pricing can be addressed through existing anti-discrimination laws and regulations. protecting consumer rights.

The report notes that at this time, there is only anecdotal evidence of how companies are using big data in the context of personalized marketing and differentiated pricing. This information shows that sellers use pricing methods that can be divided into three categories:

  • study of the demand curve;
  • Steering and differentiated pricing based on demographic data; And
  • targeted behavioral marketing (behavioral targeting) and individualized pricing.

Studying the Demand Curve: To determine demand and study consumer behavior, marketers often conduct experiments in this area in which customers are randomly assigned to one of two possible price categories. “Technically, these experiments are a form of differential pricing because they result in different prices for customers, even if they are “non-discriminatory” in the sense that all customers have the same probability of being “sent” to a higher price.”

Steering: It is the practice of presenting products to consumers based on their membership in a specific demographic group. Thus, a computer company's website may offer the same laptop various types buyers at different prices based on the information they provide about themselves (for example, depending on whether this user a representative of government agencies, scientific or commercial institutions, or a private individual) or from their geographical location (for example, determined by the IP address of a computer).

Targeted behavioral marketing and customized pricing: In these cases, customers' personal information is used to target advertising and customize pricing for certain products. For example, online advertisers use data collected by advertising networks and through third-party cookies about online user activity to target their advertisements. This approach, on the one hand, allows consumers to receive advertising of goods and services that are of interest to them. However, it may cause concern for those consumers who do not want certain types their personal data (such as information about visits to websites related to medical and financial issues) was collected without their consent.

Although targeted behavioral marketing is widespread, there is relatively little evidence of personalized pricing in the online environment. The report speculates that this may be because the methods are still being developed, or because companies are hesitant to use custom pricing (or prefer to keep quiet about it) - perhaps fearing a backlash from consumers.

The report's authors suggest that "for the individual consumer, the use of big data clearly presents both potential rewards and risks." While acknowledging that big data raises transparency and discrimination issues, the report argues that existing anti-discrimination and consumer protection laws are sufficient to address them. However, the report also highlights the need for “ongoing oversight” when companies use sensitive information in ways that are not transparent or in ways that are not covered by existing regulatory frameworks.

This report continues the White House's efforts to examine the use of big data and discriminatory pricing on the Internet and the resulting consequences for American consumers. It was previously reported that the White House Big Data Working Group published its report on this issue in May 2014. The Federal Trade Commission (FTC) also addressed these issues during its September 2014 workshop on big data discrimination.

2014

Gartner dispels myths about Big Data

A fall 2014 research note from Gartner lists a number of common Big Data myths among IT leaders and provides rebuttals to them.

  • Everyone is implementing Big Data processing systems faster than us

Interest in Big Data technologies is at an all-time high: 73% of organizations surveyed by Gartner analysts this year are already investing in or planning to do so. But most of these initiatives are still in the very early stages, and only 13% of respondents have already implemented such solutions. The most difficult thing is to determine how to extract income from Big Data, to decide where to start. Many organizations get stuck at the pilot stage because they cannot commit new technology to specific business processes.

  • We have so much data that there is no need to worry about small errors in it

Some IT managers believe that small data flaws do not affect the overall results of analyzing huge volumes. When there is a lot of data, each individual error actually has less of an impact on the result, analysts note, but the errors themselves also become more numerous. In addition, most of the analyzed data is external, of unknown structure or origin, so the likelihood of errors increases. So in the world of Big Data, quality is actually much more important.

  • Big Data technologies will eliminate the need for data integration

Big Data promises the ability to process data in its original format, with automatic schema generation as it is read. It is believed that this will allow information from the same sources to be analyzed using multiple data models. Many believe that this will also enable end users to interpret any data set as they see fit. In reality, most users often want the traditional way with a ready-made schema, where the data is formatted appropriately and there are agreements on the level of integrity of the information and how it should relate to the use case.

  • There is no point in using data warehouses for complex analytics

Many information management system administrators believe that there is no point in spending time creating a data warehouse, given that complex analytical systems use new data types. In fact, many complex analytics systems use information from a data warehouse. In other cases, new types of data need to be additionally prepared for analysis in Big Data processing systems; decisions have to be made about the suitability of data, principles of aggregation and required level quality - such preparation can occur outside the warehouse.

  • Data warehouses will be replaced by data lakes

In reality, vendors mislead customers by positioning data lakes as a replacement for storage or as critical elements of the analytical infrastructure. Underlying data lake technologies lack the maturity and breadth of functionality found in warehouses. Therefore, managers responsible for data management should wait until lakes reach the same level of development, according to Gartner.

Accenture: 92% of those who implemented big data systems are satisfied with the results

Among the main advantages of big data, respondents named:

  • “searching for new sources of income” (56%),
  • “improving customer experience” (51%),
  • “new products and services” (50%) and
  • “an influx of new customers and maintaining the loyalty of old ones” (47%).

When introducing new technologies, many companies are faced with traditional problems. For 51%, the stumbling block was security, for 47% - budget, for 41% - lack of necessary personnel, and for 35% - difficulties in integrating with the existing system. Almost all companies surveyed (about 91%) plan to soon solve the problem of staff shortages and hire big data specialists.

Companies are optimistic about the future of big data technologies. 89% believe they will change business as much as the Internet. 79% of respondents noted that companies that do not engage in big data will lose their competitive advantage.

However, respondents disagreed about what exactly should be considered big data. 65% of respondents believe that these are “large data files”, 60% believe that this is “advanced analytics and analysis”, and 50% believe that this is “data visualization tools”.

Madrid spends €14.7 million on big data management

In July 2014, it became known that Madrid would use big data technologies to manage city infrastructure. The cost of the project is 14.7 million euros, the basis of the implemented solutions will be technologies for analyzing and managing big data. With their help, the city administration will manage work with each service provider and pay accordingly depending on the level of services.

We are talking about administration contractors who monitor the condition of streets, lighting, irrigation, green spaces, clean up the territory and remove, as well as waste recycling. During the project, 300 key performance indicators of city services were developed for specially designated inspectors, on the basis of which 1.5 thousand various checks and measurements will be carried out daily. In addition, the city will begin using an innovative technology platform called Madrid iNTeligente (MiNT) - Smarter Madrid.

2013

Experts: Big Data is in fashion

Without exception, all vendors in the data management market are currently developing technologies for Big Data management. This new technological trend is also actively discussed by the professional community, both developers and industry analysts and potential consumers of such solutions.

As Datashift found out, as of January 2013, there was a wave of discussions around “ big data"exceeded all imaginable dimensions. After analyzing the number of mentions of Big Data on social networks, Datashift calculated that in 2012 the term was used about 2 billion times in posts created by about 1 million different authors around the world. This is equivalent to 260 posts per hour, with a peak of 3,070 mentions per hour.

Gartner: Every second CIO is ready to spend money on Big data

After several years of experimentation with Big data technologies and the first implementations in 2013, the adaptation of such solutions will increase significantly, Gartner predicts. Researchers surveyed IT leaders around the world and found that 42% of respondents have already invested in Big data technologies or plan to make such investments within the next year (data as of March 2013).

Companies are forced to spend money on processing technologies big data, since the information landscape is rapidly changing, requiring new approaches to information processing. Many companies have already realized that large amounts of data are critical, and working with them allows them to achieve benefits that are not available using traditional sources of information and methods of processing it. In addition, the constant discussion of the topic of “big data” in the media fuels interest in relevant technologies.

Frank Buytendijk, a vice president at Gartner, even urged companies to tone down their efforts as some worry they are falling behind competitors in their adoption of Big Data.

“There is no need to worry; the possibilities for implementing ideas based on big data technologies are virtually endless,” he said.

Gartner predicts that by 2015, 20% of Global 1000 companies will have a strategic focus on “information infrastructure.”

In anticipation of the new opportunities that big data processing technologies will bring, many organizations are already organizing the process of collecting and storing various types of information.

For educational and government organizations, as well as industrial companies, the greatest potential for business transformation lies in the combination of accumulated data with so-called dark data (literally “dark data”), the latter includes messages Email, multimedia and other similar content. According to Gartner, the winners in the data race will be those who learn to handle the most different sources information.

Cisco survey: Big Data will help increase IT budgets

The Spring 2013 Cisco Connected World Technology Report, conducted in 18 countries by independent research firm InsightExpress, surveyed 1,800 college students and an equal number of young professionals between the ages of 18 and 30. The survey was conducted to find out the level of readiness of IT departments to implement projects Big Data and gain insight into the challenges involved, technological shortcomings and strategic value of such projects.

Most companies collect, record and analyze data. However, the report says, many companies face a range of complex business and information technology challenges with Big Data. For example, 60 percent of respondents admit that Big Data solutions can improve decision-making processes and increase competitiveness, but only 28 percent said that they are already receiving real strategic benefits from the accumulated information.

More than half of the IT executives surveyed believe that Big Data projects will help increase IT budgets in their organizations, as there will be increased demands on technology, personnel and professional skills. At the same time, more than half of respondents expect that such projects will increase IT budgets in their companies as early as 2012. 57 percent are confident that Big Data will increase their budgets over the next three years.

81 percent of respondents said that all (or at least some) Big Data projects will require the use of cloud computing. Thus, the spread cloud technologies may impact the adoption rate of Big Data solutions and the business value of those solutions.

Companies collect and use many different types of data, both structured and unstructured. Here are the sources from which survey participants receive their data (Cisco Connected World Technology Report):

Nearly half (48 percent) of IT leaders predict the load on their networks will double over the next two years. (This is especially true in China, where 68 percent of respondents share this view, and in Germany – 60 percent). 23 percent of respondents expect network load to triple over the next two years. At the same time, only 40 percent of respondents declared their readiness for explosive growth in network traffic volumes.

27 percent of respondents admitted that they need better IT policies and information security measures.

21 percent need more bandwidth.

Big Data opens up new opportunities for IT departments to add value and build strong relationships with business units, allowing them to increase revenue and strengthen the company's financial position. Big Data projects make IT departments a strategic partner to business departments.

According to 73 percent of respondents, the IT department will become the main driver of the implementation of the Big Data strategy. At the same time, respondents believe that other departments will also be involved in the implementation of this strategy. First of all, this concerns the departments of finance (named by 24 percent of respondents), research and development (20 percent), operations (20 percent), engineering (19 percent), as well as marketing (15 percent) and sales (14 percent).

Gartner: Millions of new jobs needed to manage big data

Global IT spending will reach $3.7 billion by 2013, which is 3.8% more than spending on information technology in 2012 (year-end forecast is $3.6 billion). Segment big data(big data) will develop at a much faster pace, says a Gartner report.

By 2015, 4.4 million jobs in information technology will be created to service big data, of which 1.9 million jobs will be in . Moreover, each workplace will entail the creation of three additional jobs outside of the IT sector, so that in the United States alone in the next four years 6 million people will work to support the information economy.

According to Gartner experts, the main problem The problem is that there is not enough talent in the industry for this: both the private and public educational systems, for example in the USA, are not able to supply the industry with a sufficient number of qualified personnel. So of the new IT jobs mentioned, only one out of three will be staffed.

Analysts believe that the role of nurturing qualified IT personnel should be taken directly by companies that urgently need them, since such employees will be their ticket to the new information economy of the future.

2012

The first skepticism regarding "Big Data"

Analysts from Ovum and Gartner suggest that for a fashionable topic in 2012 big data The time may come to liberate yourself from illusions.

The term “Big Data” at this time typically refers to the ever-increasing volume of information flowing online from social media, sensor networks and other sources, as well as the growing range of tools used to process the data and identify business-relevant data from it. -trends.

“Because of (or despite) the hype around the idea of ​​big data, manufacturers in 2012 looked at this trend with great hope,” said Tony Bayer, an analyst at Ovum.

Bayer reported that DataSift conducted a retrospective analysis of big data mentions in

Internet